You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Whenever you build an Actor, think of the original request/idea and the "use case" = "user need" it should solve, please take notes and share them with Apify, so we can help you write a blog post supporting your Actor with more information, more detailed explanation, better SEO.
17
17
- Consider adding a video, images, and screenshots to your README to break up the text.
18
18
- This is an example of an Actor with a README that corresponds well to the guidelines below:
Copy file name to clipboardExpand all lines: sources/academy/tutorials/node_js/handle_blocked_requests_puppeteer.md
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,11 +54,7 @@ const crawler = new PuppeteerCrawler({
54
54
});
55
55
```
56
56
57
-
<<<<<<< HEAD
58
-
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://kb.apify.com/tips-and-tricks/several-tips-how-to-bypass-website-anti-scraping-protections). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.
59
-
=======
60
-
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://docs.apify.com/academy/anti-scraping). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be just missing or corrupted. The developer can then choose if he will try to handle these problems in the code or just focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.
61
-
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
57
+
It is really up to a developer to spot if something is wrong with his request. A website can interfere with your crawling in [many ways](https://docs.apify.com/academy/anti-scraping). Page loading can be cancelled right away, it can timeout, the page can display a captcha, some error or warning message, or the data may be missing or corrupted. The developer can then choose if he will try to handle these problems in the code or focus on receiving the proper data. Either way, if the request went wrong, you should throw a proper error.
62
58
63
59
Now that we know when the request is blocked, we can use the retire() function and continue crawling with a new proxy. Google is one of the most popular websites for scrapers, so let's code a Google search crawler. The two main blocking mechanisms used by Google is either to display their (in)famous 'sorry' captcha or to not load the page at all so we will focus on covering these.
Copy file name to clipboardExpand all lines: sources/academy/tutorials/node_js/scraping_from_sitemaps.md
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,11 +13,7 @@ import Example from '!!raw-loader!roa-loader!./scraping_from_sitemaps.js';
13
13
14
14
---
15
15
16
-
<<<<<<< HEAD
17
-
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
18
-
=======
19
-
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [https://www.brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
20
-
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
16
+
Let's say we want to scrape a database of craft beers ([brewbound.com](https://www.brewbound.com/)) before summer starts. If we are lucky, the website will contain a sitemap at [brewbound.com/sitemap.xml](https://www.brewbound.com/sitemap.xml).
21
17
22
18
> Check out [Sitemap Sniffer](https://apify.com/vaclavrut/sitemap-sniffer), which can discover sitemaps in hidden locations!
Copy file name to clipboardExpand all lines: sources/academy/webscraping/api_scraping/graphql_scraping/introspection.md
+1-5Lines changed: 1 addition & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -148,11 +148,7 @@ Now that we have this visualization to work off of, it will be much easier to bu
148
148
149
149
In future lessons, we'll be building more complex queries using **dynamic variables** and advanced features such as **fragments**; however, for now let's get our feet wet by using the data we have from GraphQL Voyager to build a query.
150
150
151
-
<<<<<<< HEAD
152
-
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!
153
-
=======
154
-
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://www.cheddar.com/). From each article, we'd like to fetch the **title** and the **publish date**. After just a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!
155
-
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
151
+
Right now, our goal is to fetch the 1000 most recent articles on [Cheddar](https://www.cheddar.com). From each article, we'd like to fetch the **title** and the **publish date**. After a bit of digging through the schema, we've come across the **media** field within the **organization** type, which has both **title** and **public_at** fields - seems to check out!
156
152
157
153

Copy file name to clipboardExpand all lines: sources/platform/actors/development/deployment/continuous_integration.md
+1-51Lines changed: 1 addition & 51 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ This article focuses on GitHub, but [we also have a guide for Bitbucket](https:/
25
25
To set up automated builds and tests for your Actors you need to:
26
26
27
27
1. Create a GitHub repository for your Actor code.
28
-
1. Get your Apify API token from the [Apify Console](https://console.apify.com/account#/integrations)
28
+
1. Get your Apify API token from the [Apify Console](https://console.apify.com/settings/integrations)
29
29
30
30

31
31
@@ -75,11 +75,7 @@ To set up automated builds and tests for your Actors you need to:
75
75
76
76
</TabItem>
77
77
78
-
<<<<<<< HEAD
79
78
<TabItem value="beta.yml" label="beta.yml">
80
-
=======
81
-
[Find the Bitbucket version here](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket).
82
-
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
83
79
84
80
```yaml
85
81
name: Test and build beta version
@@ -107,59 +103,13 @@ To set up automated builds and tests for your Actors you need to:
107
103
</TabItem>
108
104
</Tabs>
109
105
110
-
<<<<<<< HEAD
111
106
## GitHub integration
112
-
=======
113
-
[Add the token to GitHub secrets](https://docs.github.com/en/actions/security-guides/using-secrets-in-github-actions#creating-encrypted-secrets-for-a-repository). Go to **your repo > Settings > Secrets > New repository secret**.
114
-
>>>>>>> 3aee8e01 (fix: avoid permanent redirects)
115
107
116
108
To set up automatic builds from GitHub:
117
109
118
-
<<<<<<< HEAD
119
110
1. Go to your Actor's detail page and coy the Build Actor API endpoint URL from the API tab.
120
111
1. In your GitHub repository, go to Settings > Webhooks > Add webhook.

128
-
129
-
## Set up automatic builds
130
-
131
-
[//]: #(TODO: This duplicates somehow the above part)
132
-
133
-
Once you have your [prerequisites](#prerequisites), you can start automating your builds. You can use [webhooks](https://en.wikipedia.org/wiki/Webhook) or the [Apify CLI](/cli/) ([described in our Bitbucket guide](https://help.apify.com/en/articles/6988586-setting-up-continuous-integration-for-apify-actors-on-bitbucket)) in your Git workflow.
134
-
135
-
To use webhooks, you can use the [distributhor/workflow-webhook](https://github.com/distributhor/workflow-webhook) action, which uses the secrets described in the [prerequisites](#prerequisites) section.
136
-
137
-
```yaml
138
-
name: Build Actor
139
-
- uses: distributhor/workflow-webhook@v1
140
-
env:
141
-
webhook_url: ${{ secrets.[VERSION]_BUILD_URL }}
142
-
webhook_secret: ${{ secrets.APIFY_TOKEN }}
143
-
```
144
-
145
-
You can find your builds under the Actor's **Builds** section.
146
-
147
-

148
-
149
-
## [](#github-integration) Automatic builds from GitHub
If the source code of an Actor is hosted in a [Git repository](#git-repository), it is possible to set up an integration so that the Actor is automatically rebuilt on every push to the Git repository. For that, you only need to set up a webhook in your Git source control system that will invoke the [Build Actor](/api/v2/#/reference/actors/build-collection/build-actor) API endpoint on every push.
154
-
155
-
For repositories on GitHub, you can use the following steps. First, go to the Actor detail page, open the **API** tab, and copy the **Build Actor** API endpoint URL. It should look something like this:
Then go to your GitHub repository, click **Settings**, select **Webhooks** tab and click **Add webhook**. Paste the API URL to the **Payload URL** as follows:
0 commit comments