Skip to content

Commit 549600b

Browse files
authored
Update documentation for CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS (#125)
1 parent 8752166 commit 549600b

File tree

1 file changed

+14
-2
lines changed

1 file changed

+14
-2
lines changed

service_config/crawler.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,9 @@
1313
- [CRAWLER\_SCANCODE\_PARALLELISM](#crawler_scancode_parallelism)
1414
- [CRAWLER\_SERVICE\_AUTH\_TOKEN](#crawler_service_auth_token)
1515
- [CRAWLER\_SERVICE\_URL](#crawler_service_url)
16-
- [CRAWLER\_STORE\_PROVIDER](#crawler_store_provider)
16+
- [CRAWLER\_STORE\_PROVIDER](#crawler_store_provider)
1717
- [CRAWLER\_WEBHOOK](#crawler_webhook)
18+
- [CRAWLER\_HARVESTS\_QUEUE\_VISIBILITY\_TIMEOUT\_SECONDS](#crawler_harvest_queue_visibility_timeout_seconds)
1819
- [Docker environmental variables](#docker-environmental-variables)
1920
- [DOCKER\_ENABLE\_CI](#docker_enable_ci)
2021
- [HARVEST\_AZBLOB](#harvest_azblob)
@@ -46,6 +47,7 @@ The environmental variables for the cdcrawler-dev App Service include:
4647
* CRAWLER_STORE_PROVIDER
4748
* CRAWLER_WEBHOOK_TOKEN
4849
* CRAWLER_WEBHOOK_URL
50+
* CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS
4951
* DOCKER_CUSTOM_IMAGE_NAME
5052
* DOCKER_ENABLE_CI
5153
* DOCKER_REGISTRY_SERVER_PASSWORD
@@ -131,7 +133,7 @@ It's unclear where this environmental variable is used within the crawler.
131133

132134
We use multiple services to store the crawler's harvests of license information.
133135

134-
If you look at the value of this environmental variable, you will see that it is **"cdDispatch+cd(azblob)+webhook"**
136+
If you look at the value of this environmental variable, you will see that it is **"cdDispatch+cd(azblob)+webhook"**. In the production crawler Dockerfile, it is configured as **"cdDispatch+cd(azblob)+azqueue"**.
135137

136138
These are used by [the crawler configuration code](https://github.com/clearlydefined/crawler/blob/32a0d6b59edfda5d3226c50680e4a8338af395cd/config/cdConfig.js).
137139

@@ -151,6 +153,10 @@ We use a few different "dispatchers" - which are used to fetch GitHub repos or P
151153

152154
cdDispatch refers to the generic base file that handles calls to the various dispatchers.
153155

156+
**azqueue**
157+
158+
This refers to an Azure Storage Queue used by the crawler to notify a service upon the completion of a tool’s processing. The default queue name is `harvests`. More details on the configuration can be found in the [cdConfig.js file](https://github.com/clearlydefined/crawler/blob/32a0d6b59edfda5d3226c50680e4a8338af395cd/config/cdConfig.js#L95).
159+
154160
### CRAWLER_WEBHOOK
155161

156162
These environmental variables are used to define the url for the ClearlyDefined service's webhook URL (This is what the crawler calls after it completes a harvest).
@@ -159,6 +165,12 @@ In Dev the webhook url is "https://dev-api.clearlydefined.io/webhook".
159165

160166
The token is what we use to authenticate to the API (so that only the crawler can call that part of the ClearlyDefined Service api)
161167

168+
### CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS
169+
170+
This environment variable is optional and specifically applies to the `azqueue` crawler store provider. It sets the visibility timeout, which determines how long messages remain hidden after being pushed onto the queue.
171+
172+
The default value is `0`. For production crawlers, this value is explicitly set to `300 seconds` (5 minutes).
173+
162174
## Docker environmental variables
163175

164176
The Docker environmental variables define what container image is used for the Crawler, as well as what registry that image is kept in, and authentication info for the registry.

0 commit comments

Comments
 (0)