Skip to content

Commit 62ce7f2

Browse files
committed
feat: add images
1 parent 8c63a19 commit 62ce7f2

File tree

1 file changed

+61
-6
lines changed

1 file changed

+61
-6
lines changed

sources/academy/webscraping/scraping_basics_python/13_platform.md

Lines changed: 61 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -80,7 +80,9 @@ The file contains a single asynchronous function, `main()`. At the beginning, it
8080

8181
Every program that runs on the Apify platform first needs to be packaged as a so-called Actor—a standardized container with designated places for input and output. Crawlee scrapers automatically connect their default dataset to the Actor output, but input must be handled explicitly in the code.
8282

83-
We'll now adjust the template so it runs our program for watching prices. As a first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with the [final code](./12_framework.md#logging) from the previous lesson:
83+
![The expected file structure](./images/actor-file-structure.png)
84+
85+
We'll now adjust the template so that it runs our program for watching prices. As the first step, we'll create a new empty file, `crawler.py`, inside the `warehouse-watchdog/src` directory. Then, we'll fill this file with final, unchanged code from the previous lesson:
8486

8587
```py title=warehouse-watchdog/src/crawler.py
8688
import asyncio
@@ -95,7 +97,50 @@ async def main():
9597
context.log.info("Looking for product detail pages")
9698
await context.enqueue_links(selector=".product-list a.product-item__title", label="DETAIL")
9799

98-
...
100+
@crawler.router.handler("DETAIL")
101+
async def handle_detail(context):
102+
context.log.info(f"Product detail page: {context.request.url}")
103+
price_text = (
104+
context.soup
105+
.select_one(".product-form__info-content .price")
106+
.contents[-1]
107+
.strip()
108+
.replace("$", "")
109+
.replace(",", "")
110+
)
111+
item = {
112+
"url": context.request.url,
113+
"title": context.soup.select_one(".product-meta__title").text.strip(),
114+
"vendor": context.soup.select_one(".product-meta__vendor").text.strip(),
115+
"price": Decimal(price_text),
116+
"variant_name": None,
117+
}
118+
if variants := context.soup.select(".product-form__option.no-js option"):
119+
for variant in variants:
120+
context.log.info("Saving a product variant")
121+
await context.push_data(item | parse_variant(variant))
122+
else:
123+
context.log.info("Saving a product")
124+
await context.push_data(item)
125+
126+
await crawler.run(["https://warehouse-theme-metal.myshopify.com/collections/sales"])
127+
128+
crawler.log.info("Exporting data")
129+
await crawler.export_data_json(path='dataset.json', ensure_ascii=False, indent=2)
130+
await crawler.export_data_csv(path='dataset.csv')
131+
132+
def parse_variant(variant):
133+
text = variant.text.strip()
134+
name, price_text = text.split(" - ")
135+
price = Decimal(
136+
price_text
137+
.replace("$", "")
138+
.replace(",", "")
139+
)
140+
return {"variant_name": name, "price": price}
141+
142+
if __name__ == '__main__':
143+
asyncio.run(main())
99144
```
100145

101146
Now, let's replace the contents of `warehouse-watchdog/src/main.py` with this:
@@ -109,7 +154,7 @@ async def main():
109154
await crawl()
110155
```
111156

112-
We import our program as a function and await the result inside the Actor block. Unlike the sample scraper, our program doesn't expect any input data, so we can delete the code handling that part.
157+
We import our scraper as a function and await the result inside the Actor block. Unlike the sample scraper, the one we made in the previous lesson doesn't expect any input data, so we can omit the code that handles that part.
113158

114159
Next, we'll change to the `warehouse-watchdog` directory in our terminal and verify that everything works locally before deploying the project to the cloud:
115160

@@ -203,11 +248,15 @@ Actor build detail https://console.apify.com/actors/a123bCDefghiJkLMN#/builds/0.
203248
? Do you want to open the Actor detail in your browser? (Y/n)
204249
```
205250

206-
After agreeing to open the Actor details in our browser, assuming we're logged in, we'll see an option to **Start Actor**. Clicking it opens the execution settings. We won’t change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper is running in the cloud.
251+
After opening the link in our browser, assuming we're logged in, we'll see the **Source** screen on the Actor's detail page. We'll go to the **Input** tab of that screen. We won't change anything—just hit **Start**, and we should see logs similar to what we see locally, but this time our scraper will be running in the cloud.
252+
253+
![Actor's detail page, screen Source, tab Input](./images/actor-input.png)
207254

208255
When the run finishes, the interface will turn green. On the **Output** tab, we can preview the results as a table or JSON. We can even export the data to formats like CSV, XML, Excel, RSS, and more.
209256

210-
:::note Accessing data programmatically
257+
![Actor's detail page, screen Source, tab Output](./images/actor-output.png)
258+
259+
:::info Accessing data programmatically
211260

212261
You don't need to click buttons to download the data. You can also retrieve it using [Apify's API](https://docs.apify.com/api/v2/dataset-items-get), the [`apify datasets`](https://docs.apify.com/cli/docs/reference#datasets) CLI command, or the [`apify`](https://docs.apify.com/api/client/python/docs/examples/retrieve-actor-data) Python SDK.
213262

@@ -219,6 +268,8 @@ Now that our scraper is deployed, let's automate its execution. In the Apify web
219268

220269
From now on, the Actor will execute daily. We can inspect each run, view logs, check collected data, see stats, monitor charts, and even set up alerts.
221270

271+
![Schedule detail page](./images/actor-schedule.png)
272+
222273
## Adding support for proxies
223274

224275
If monitoring shows that our scraper frequently fails to reach the Warehouse Shop website, it's likely being blocked. To avoid this, we can configure proxies so our requests come from different locations, reducing the chances of detection and blocking.
@@ -331,7 +382,11 @@ Run: Building Actor warehouse-watchdog
331382
? Do you want to open the Actor detail in your browser? (Y/n)
332383
```
333384

334-
Back in the Apify console, go to the **Source** screen and switch to the **Input** tab. You'll see the new **Proxy config** option, which defaults to **Datacenter - Automatic**. Leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
385+
Back in the Apify console, go to the **Source** screen and switch to the **Input** tab. You'll see the new **Proxy config** option, which defaults to **Datacenter - Automatic**.
386+
387+
![Actor's detail page, screen Source, tab Input with proxies](./images/actor-input-proxies.png)
388+
389+
Leave it as is and click **Start**. This time, the logs should show `Using proxy: yes`, as the scraper uses proxies provided by the platform:
335390

336391
```text
337392
(timestamp) ACTOR: Pulling Docker image of build o6vHvr5KwA1sGNxP0 from repository.

0 commit comments

Comments
 (0)