-
Notifications
You must be signed in to change notification settings - Fork 15
Closed
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.
Milestone
Description
Description
Request deduplication does not always work in the Apify-Scrapy integration.
Reproduction
- Use the sample code from the Scrapy guide or the Scrapy template.
- Input:
{
"allowedDomains": [
"crawlee.dev"
],
"proxyConfiguration": {
"useApifyProxy": false
},
"startUrls": [
{
"url": "https://crawlee.dev/",
"method": "GET"
}
]
}
Observed behavior
- The start URL "https://crawlee.dev" was crawled four times.
- The URL "https://crawlee.dev/docs/examples" was crawled twice.
Logs:
2025-02-10T17:26:24.729Z ACTOR: Pulling Docker image of build 0VUi8LhZspd5TTEGF from repository.
2025-02-10T17:26:31.307Z ACTOR: Creating Docker container.
2025-02-10T17:26:32.118Z ACTOR: Starting Docker container.
2025-02-10T17:26:35.619Z [apify] INFO Initializing Actor...
2025-02-10T17:26:35.622Z [apify] INFO Initializing Actor... ({"message": "Initializing Actor..."})
2025-02-10T17:26:35.625Z [apify] INFO System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux"})
2025-02-10T17:26:35.628Z [apify] INFO System info ({"apify_sdk_version": "2.2.2", "apify_client_version": "1.9.1", "crawlee_version": "0.5.4", "python_version": "3.12.8", "os": "linux", "message": "System info"})
2025-02-10T17:26:35.731Z [scrapy.addons] INFO Enabled addons:
2025-02-10T17:26:35.734Z [] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.841Z [scrapy.middleware] INFO Enabled extensions:
2025-02-10T17:26:35.844Z ['scrapy.extensions.corestats.CoreStats',
2025-02-10T17:26:35.846Z 'scrapy.extensions.memusage.MemoryUsage',
2025-02-10T17:26:35.848Z 'scrapy.extensions.logstats.LogStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:35.851Z [scrapy.crawler] INFO Overridden settings:
2025-02-10T17:26:35.854Z {'BOT_NAME': 'titlebot',
2025-02-10T17:26:35.856Z 'DEPTH_LIMIT': 1,
2025-02-10T17:26:35.859Z 'LOG_LEVEL': 'INFO',
2025-02-10T17:26:35.862Z 'NEWSPIDER_MODULE': 'src.spiders',
2025-02-10T17:26:35.864Z 'ROBOTSTXT_OBEY': True,
2025-02-10T17:26:35.867Z 'SCHEDULER': 'apify.scrapy.scheduler.ApifyScheduler',
2025-02-10T17:26:35.869Z 'SPIDER_MODULES': ['src.spiders'],
2025-02-10T17:26:35.872Z 'TELNETCONSOLE_ENABLED': False,
2025-02-10T17:26:35.874Z 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2025-02-10T17:26:36.176Z [apify] INFO ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False.
2025-02-10T17:26:36.179Z [apify] INFO ApifyHttpProxyMiddleware is not going to be used. Actor input field "proxyConfiguration.useApifyProxy" is set to False. ({"message": "ApifyHttpProxyMiddleware is not going to be used. Actor input field \"proxyConfiguration.useApifyProxy\" is set to False."})
2025-02-10T17:26:36.182Z [scrapy.middleware] INFO Enabled downloader middlewares:
2025-02-10T17:26:36.185Z ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
2025-02-10T17:26:36.188Z 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
2025-02-10T17:26:36.191Z 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
2025-02-10T17:26:36.193Z 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
2025-02-10T17:26:36.195Z 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
2025-02-10T17:26:36.198Z 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
2025-02-10T17:26:36.200Z 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
2025-02-10T17:26:36.203Z 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
2025-02-10T17:26:36.205Z 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
2025-02-10T17:26:36.208Z 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
2025-02-10T17:26:36.211Z 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
2025-02-10T17:26:36.213Z 'scrapy.downloadermiddlewares.stats.DownloaderStats'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.215Z [scrapy.middleware] INFO Enabled spider middlewares:
2025-02-10T17:26:36.218Z ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
2025-02-10T17:26:36.220Z 'scrapy.spidermiddlewares.referer.RefererMiddleware',
2025-02-10T17:26:36.223Z 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
2025-02-10T17:26:36.225Z 'scrapy.spidermiddlewares.depth.DepthMiddleware'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.227Z [scrapy.middleware] INFO Enabled item pipelines:
2025-02-10T17:26:36.230Z ['apify.scrapy.pipelines.ActorDatasetPushPipeline'] ({"crawler": "<scrapy.crawler.Crawler object at 0x718670537e90>"})
2025-02-10T17:26:36.232Z [scrapy.core.engine] INFO Spider opened ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.343Z [scrapy.extensions.logstats] INFO Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:36.832Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:38.548Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.136Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:39.909Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.332Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.527Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:40.539Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/python>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.042Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/next/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.311Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/core/changelog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.331Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.663Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.676Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.11/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.897Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:41.931Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.10/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.207Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.227Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.9/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.456Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.478Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.8/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.762Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.7/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:42.972Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.6/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.216Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.5/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.416Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.4/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.700Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.3/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:43.923Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.2/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.189Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.1/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.423Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/3.0/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.638Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/introduction>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:44.851Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.061Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/typescript-project>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.288Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/avoid-blocking>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.504Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/cheerio-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.710Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/jsdom-crawler-guide>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:45.936Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/javascript-rendering>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.127Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/core/class/AutoscaledPool>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.372Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/proxy-management>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.618Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/result-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:46.787Z [scrapy.spidermiddlewares.urllength] INFO Ignoring link (url length > 2083): https://console.apify.com/actors/6i5QsHBMtm3hKph70?runConfig=eyJ1IjoiRWdQdHczb2VqNlRhRHQ1cW4iLCJ2IjoxfQ.eyJpbnB1dCI6IntcbiAgICBcImNvZGVcIjogXCJpbXBvcnQgeyBQbGF5d3JpZ2h0Q3Jhd2xlciB9IGZyb20gJ2NyYXdsZWUnO1xcblxcbi8vIEltcG9ydCB0aGUgYEFjdG9yYCBjbGFzcyBmcm9tIHRoZSBBcGlmeSBTREsuXFxuaW1wb3J0IHsgQWN0b3IgfSBmcm9tICdhcGlmeSc7XFxuXFxuLy8gU2V0IHVwIHRoZSBpbnRlZ3JhdGlvbiB0byBBcGlmeS5cXG5hd2FpdCBBY3Rvci5pbml0KCk7XFxuXFxuLy8gQ3Jhd2xlciBzZXR1cCBmcm9tIHRoZSBwcmV2aW91cyBleGFtcGxlLlxcbmNvbnN0IGNyYXdsZXIgPSBuZXcgUGxheXdyaWdodENyYXdsZXIoe1xcbiAgICAvLyBVc2UgdGhlIHJlcXVlc3RIYW5kbGVyIHRvIHByb2Nlc3MgZWFjaCBvZiB0aGUgY3Jhd2xlZCBwYWdlcy5cXG4gICAgYXN5bmMgcmVxdWVzdEhhbmRsZXIoeyByZXF1ZXN0LCBwYWdlLCBlbnF1ZXVlTGlua3MsIHB1c2hEYXRhLCBsb2cgfSkge1xcbiAgICAgICAgY29uc3QgdGl0bGUgPSBhd2FpdCBwYWdlLnRpdGxlKCk7XFxuICAgICAgICBsb2cuaW5mbyhgVGl0bGUgb2YgJHtyZXF1ZXN0LmxvYWRlZFVybH0gaXMgJyR7dGl0bGV9J2ApO1xcblxcbiAgICAgICAgLy8gU2F2ZSByZXN1bHRzIGFz... [line-too-long]
2025-02-10T17:26:46.851Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides/request-storage>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.052Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/utils/namespace/social>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.287Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/utils>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.503Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/basic-crawler/interface/BasicCrawlerOptions>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.776Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/quick-start>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:47.970Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/aws-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.029Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/deployment/gcp-cheerio>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.081Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/guides>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.114Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/examples>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.426Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/docs/upgrading/upgrading-to-v3>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.500Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/blog>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:48.530Z [title_spider] INFO TitleSpider is parsing <200 https://crawlee.dev/api/core>... ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.277Z [scrapy.core.engine] INFO Closing spider (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.281Z [scrapy.statscollectors] INFO Dumping Scrapy stats:
2025-02-10T17:26:49.283Z {'downloader/request_bytes': 13234,
2025-02-10T17:26:49.286Z 'downloader/request_count': 49,
2025-02-10T17:26:49.288Z 'downloader/request_method_count/GET': 49,
2025-02-10T17:26:49.291Z 'downloader/response_bytes': 1384307,
2025-02-10T17:26:49.293Z 'downloader/response_count': 49,
2025-02-10T17:26:49.296Z 'downloader/response_status_count/200': 49,
2025-02-10T17:26:49.298Z 'elapsed_time_seconds': 12.935337,
2025-02-10T17:26:49.301Z 'finish_reason': 'finished',
2025-02-10T17:26:49.303Z 'finish_time': datetime.datetime(2025, 2, 10, 17, 26, 49, 277550, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.306Z 'httpcompression/response_bytes': 8749039,
2025-02-10T17:26:49.308Z 'httpcompression/response_count': 49,
2025-02-10T17:26:49.310Z 'item_scraped_count': 48,
2025-02-10T17:26:49.313Z 'items_per_minute': None,
2025-02-10T17:26:49.316Z 'log_count/INFO': 58,
2025-02-10T17:26:49.318Z 'memusage/max': 105684992,
2025-02-10T17:26:49.320Z 'memusage/startup': 105684992,
2025-02-10T17:26:49.322Z 'offsite/domains': 10,
2025-02-10T17:26:49.324Z 'offsite/filtered': 20,
2025-02-10T17:26:49.327Z 'request_depth_max': 1,
2025-02-10T17:26:49.329Z 'response_received_count': 49,
2025-02-10T17:26:49.332Z 'responses_per_minute': None,
2025-02-10T17:26:49.335Z 'robotstxt/request_count': 1,
2025-02-10T17:26:49.338Z 'robotstxt/response_count': 1,
2025-02-10T17:26:49.340Z 'robotstxt/response_status_count/200': 1,
2025-02-10T17:26:49.343Z 'start_time': datetime.datetime(2025, 2, 10, 17, 26, 36, 342213, tzinfo=datetime.timezone.utc),
2025-02-10T17:26:49.345Z 'urllength/request_ignored_count': 1} ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.348Z [scrapy.core.engine] INFO Spider closed (finished) ({"spider": "<TitleSpider 'title_spider' at 0x7186703ef980>"})
2025-02-10T17:26:49.351Z [apify] INFO Exiting Actor ({"exit_code": 0})
2025-02-10T17:26:49.353Z [apify] INFO Exiting Actor ({"exit_code": 0, "message": "Exiting Actor"})
Metadata
Metadata
Assignees
Labels
bugSomething isn't working.Something isn't working.t-toolingIssues with this label are in the ownership of the tooling team.Issues with this label are in the ownership of the tooling team.