Skip to content

Commit 4df835a

Browse files
Merge #356
356: Try to fix Scrapy ERROR: Spider error processing r=alallema a=alallema Currently when running doc-scraper this error occur: ```sh 2023-03-14 12:10:56 [scrapy.core.scraper] ERROR: Spider error processing <GET https://docs.meilisearch.com/learn/advanced/geosearch.html> (referer: https://docs.meilisearch.com/sitemap.xml) Traceback (most recent call last): File "/Users/amelielallemand/.local/share/virtualenvs/docs-scraper-vWaWSN46/lib/python3.9/site-packages/twisted/internet/defer.py", line 892, in _runCallbacks current.result = callback( # type: ignore[misc] File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/documentation_spider.py", line 170, in parse_from_sitemap self.add_records(response, from_sitemap=True) File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/documentation_spider.py", line 151, in add_records records = self.strategy.get_records_from_response(response) File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/strategies/default_strategy.py", line 44, in get_records_from_response records = self.get_records_from_dom(response.url) File "/Users/amelielallemand/Projects/meili/repo/docs-scraper/scraper/src/strategies/default_strategy.py", line 67, in get_records_from_dom sys.exit('DefaultStrategy.dom is not defined') SystemExit: DefaultStrategy.dom is not defined ``` This PR try to fix it Co-authored-by: alallema <[email protected]>
2 parents d36dd0e + 4c9514d commit 4df835a

File tree

1 file changed

+1
-2
lines changed

1 file changed

+1
-2
lines changed

scraper/src/strategies/abstract_strategy.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -39,9 +39,8 @@ def get_dom(response):
3939
try:
4040
body = response.body.decode(response.encoding)
4141
result = lxml.html.fromstring(body)
42-
except (UnicodeError, ValueError):
42+
except (UnicodeError, ValueError, lxml.etree.ParserError):
4343
result = lxml.html.fromstring(response.body)
44-
4544
return result
4645

4746
def get_strip_chars(self, level, selectors):

0 commit comments

Comments
 (0)