You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the ["array" section](/platform/actors/development/actor-definition/input-schema#array) of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings.
56
+
Within our new **numbers** property, there are two more fields we must specify. Firstly, we must let the platform know that we're expecting an array of numbers with the **type** field. Then, we should also instruct Apify on which UI component to render for this input property. In our case, we have an array of numbers, which means we should use the **json** editor type that we discovered in the ["array" section](/platform/actors/development/actor-definition/input-schema/specification/v1#array) of the input schema documentation. We could also use **stringList**, but then we'd have to parse out the numbers from the strings.
Copy file name to clipboardExpand all lines: sources/academy/platform/expert_scraping_with_apify/solutions/handling_migrations.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -231,7 +231,7 @@ That's everything! Now, even if the Actor migrates (or is gracefully aborted and
231
231
232
232
**A:** It's not best to use this option by default. If it fails, there must be a reason, which would need to be thought through first - meaning that the edge case of failing should be handled when resurrecting the Actor. The state should be persisted beforehand.
233
233
234
-
**Q: Migrations happen randomly, but by [aborting gracefully](/platform/actors/running#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?**
234
+
**Q: Migrations happen randomly, but by [aborting gracefully](/platform/actors/running/runs-and-builds#aborting-runs), you can simulate a similar situation. Try this out on the platform and observe what happens. What changes occur, and what remains the same for the restarted Actor's run?**
235
235
236
236
**A:** After aborting or throwing an error mid-process, it manages to start back from where it was upon resurrection.
Copy file name to clipboardExpand all lines: sources/academy/platform/getting_started/inputs_outputs.md
+2-2Lines changed: 2 additions & 2 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -65,7 +65,7 @@ Then, replace everything in **INPUT_SCHEMA.json** with this:
65
65
}
66
66
```
67
67
68
-
> If you're interested in learning more about how the code works, and what the **INPUT_SCHEMA.json** means, read about [inputs](/sdk/js/docs/examples/accept-user-input) and [adding data to a dataset](/sdk/js/docs/examples/add-data-to-dataset) in the Apify SDK documentation, and refer to the [input schema docs](/platform/actors/development/actor-definition/input-schema#integer).
68
+
> If you're interested in learning more about how the code works, and what the **INPUT_SCHEMA.json** means, read about [inputs](/sdk/js/docs/examples/accept-user-input) and [adding data to a dataset](/sdk/js/docs/examples/add-data-to-dataset) in the Apify SDK documentation, and refer to the [input schema docs](/platform/actors/development/actor-definition/input-schema/specification/v1#integer).
69
69
70
70
Finally, **Save** and **Build** the Actor just as you did in the previous lesson.
71
71
@@ -89,7 +89,7 @@ On the results tab, there are a whole lot of options for which format to view/do
89
89
90
90
There's our solution! Did it work for you as well? Now, we can download the data right from the results tab to be used elsewhere, or even programmatically retrieve it by using [Apify's API](/api/v2) (we'll be discussing how to do this in the next lesson).
91
91
92
-
It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about [data retention on the Apify platform](/platform/storage#data-retention).
92
+
It's important to note that the default dataset of the Actor, which we pushed our solution to, will be retained for 7 days. If we wanted the data to be retained for an indefinite period of time, we'd have to use a named dataset. For more information about named storages vs unnamed storages, read a bit about [data retention on the Apify platform](/platform/storage/usage#data-retention).
@@ -79,7 +79,7 @@ async function pageFunction(context) {
79
79
}
80
80
```
81
81
82
-
### [](#description)Description
82
+
### Description
83
83
84
84
Getting the Actor's description is a little more involved, but still pretty straightforward. We can't just simply search for a `<p>` tag, because
85
85
there's a lot of them in the page. We need to narrow our search down a little. Using the DevTools we find that the Actor description is nested within
@@ -98,7 +98,7 @@ async function pageFunction(context) {
98
98
}
99
99
```
100
100
101
-
### [](#modified-date)Modified date
101
+
### Modified date
102
102
103
103
The DevTools tell us that the `modifiedDate` can be found in a `<time>` element.
104
104
@@ -126,7 +126,7 @@ But we would much rather see a readable date in our results, not a unix timestam
126
126
constructor will not accept a `string`, so we cast the `string` to a `number` using the `Number()` function before actually calling `new Date()`.
127
127
Phew!
128
128
129
-
### [](#run-count)Run count
129
+
### Run count
130
130
131
131
And so we're finishing up with the `runCount`. There's no specific element like `<time>`, so we need to create
132
132
a complex selector and then do a transformation on the result.
@@ -165,7 +165,7 @@ using a regular expression, but its type is still a `string`, so we finally conv
165
165
>
166
166
> This will give us a string (e.g. `'1234567'`) that can be converted via `Number` function.
167
167
168
-
### [](#wrapping-it-up)Wrapping it up
168
+
### Wrapping it up
169
169
170
170
And there we have it! All the data we needed in a single object. For the sake of completeness, let's add
171
171
the properties we parsed from the URL earlier and we're good to go.
@@ -243,13 +243,13 @@ async function pageFunction(context) {
243
243
}
244
244
```
245
245
246
-
### [](#test-run)Test run
246
+
### Test run
247
247
248
248
As always, try hitting that **Save & Run** button and visit
249
249
the **Dataset** preview of clean items. You should see a nice table of all the attributes correctly scraped.
250
250
You nailed it!
251
251
252
-
## [](#pagination)Pagination
252
+
## Pagination
253
253
254
254
Pagination is just a term that represents "going to the next page of results". You may have noticed that we did not
255
255
actually scrape all the Actors, just the first page of results. That's because to load the rest of the Actors,
@@ -265,7 +265,7 @@ with Cheerio? We don't have a browser to do it and we only have the HTML of the
265
265
answer is that we can't click a button. Does that mean that we cannot get the data at all? Usually not,
266
266
but it requires some clever DevTools-Fu.
267
267
268
-
### [](#analyzing-the-page)Analyzing the page
268
+
### Analyzing the page
269
269
270
270
While with Web Scraper and **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)), we could get away with simply clicking a button,
271
271
with Cheerio Scraper we need to dig a little deeper into the page's architecture. For this, we will use
@@ -281,7 +281,7 @@ Then we click the **Show more** button and wait for incoming requests to appear
281
281
Now, this is interesting. It seems that we've only received two images after clicking the button and no additional
282
282
data. This means that the data about Actors must already be available in the page and the **Show more** button only displays it. This is good news.
283
283
284
-
### [](#finding-the-actors)Finding the Actors
284
+
### Finding the Actors
285
285
286
286
Now that we know the information we seek is already in the page, we just need to find it. The first Actor in the store
287
287
is Web Scraper, so let's try using the search tool in the **Elements** tab to find some reference to it. The first
@@ -310,7 +310,7 @@ so you might already be wondering, can I just make one request to the store to g
310
310
and then parse it out and be done with it in a single request? Yes you can! And that's the power
311
311
of clever page analysis.
312
312
313
-
### [](#using-the-data-to-enqueue-all-actor-details)Using the data to enqueue all Actor details
313
+
### Using the data to enqueue all Actor details
314
314
315
315
We don't really need to go to all the Actor details now, but for the sake of practice, let's imagine we only found
316
316
Actor names such as `cheerio-scraper` and their owners, such as `apify` in the data. We will use this information
@@ -343,7 +343,7 @@ how to route those requests.
343
343
>If you're wondering how we know the structure of the URL, see the [Getting started
344
344
with Apify Scrapers](./getting_started.md) tutorial again.
345
345
346
-
### [](#plugging-it-into-the-page-function)Plugging it into the Page function
346
+
### Plugging it into the Page function
347
347
348
348
We've got the general algorithm ready, so all that's left is to integrate it into our earlier `pageFunction`.
349
349
Remember the `// Do some stuff later` comment? Let's replace it.
@@ -412,13 +412,13 @@ to get all results with Cheerio only and other times it takes hours of research.
412
412
the right scraper for your job. But don't get discouraged. Often times, the only thing you will ever need is to
413
413
define a correct Pseudo URL. Do your research first before giving up on Cheerio Scraper.
414
414
415
-
## [](#downloading-our-scraped-data)Downloading the scraped data
415
+
## Downloading the scraped data
416
416
417
417
You already know the **Dataset** tab of the run console since this is where we've always previewed our data. Notice the row of data formats such as JSON, CSV, and Excel. Below it are options for viewing and downloading the data. Go ahead and try it.
418
418
419
419
> If you prefer working with an API, you can find the example endpoint under the API tab: **Get dataset items**.
420
420
421
-
### [](#clean-items)Clean items
421
+
### Clean items
422
422
423
423
You can view and download your data without modifications, or you can choose to only get **clean** items. Data that aren't cleaned include a record
424
424
for each `pageFunction` invocation, even if you did not return any results. The record also includes hidden fields
@@ -428,7 +428,7 @@ Clean items, on the other hand, include only the data you returned from the `pag
428
428
429
429
To control this, open the **Advanced options** view on the **Dataset** tab.
430
430
431
-
## [](#bonus-making-your-code-neater)Bonus: Making your code neater
431
+
## Bonus: Making your code neater
432
432
433
433
You may have noticed that the `pageFunction` gets quite bulky. To make better sense of your code and have an easier
434
434
time maintaining or extending your task, feel free to define other functions inside the `pageFunction`
@@ -496,11 +496,11 @@ async function pageFunction(context) {
496
496
> If you're confused by the functions being declared below their executions, it's called hoisting and it's a feature
497
497
of JavaScript. It helps you put what matters on top, if you so desire.
498
498
499
-
## [](#final-word)Final word
499
+
## Final word
500
500
501
501
Thank you for reading this whole tutorial! Really! It's important to us that our users have the best information available to them so that they can use Apify easily and effectively. We're glad that you made it all the way here and congratulations on creating your first scraping task. We hope that you liked the tutorial and if there's anything you'd like to ask, [join us on Discord](https://discord.gg/jyEM2PRvMU)!
502
502
503
-
## [](#whats-next)What's next
503
+
## What's next
504
504
505
505
* Check out the [Apify SDK](https://sdk.apify.com/) and its [Getting started](https://sdk.apify.com/docs/guides/getting-started) tutorial if you'd like to try building your own Actors. It's a bit more complex and involved than writing a simple `pageFunction`, but it allows you to fine-tune all the details of your scraper to your liking.
506
506
*[Take a deep dive into Actors](/platform/actors), from how they work to [publishing](/platform/actors/publishing) them in Apify Store, and even [making money](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) on Actors.
0 commit comments