You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Crawlee provides <ApiLinkto="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> and <ApiLinkto="class/ParselCrawler">`ParselCrawler`</ApiLink> as built-in solutions for HTML parsing. However, you may want to use a different parsing library that better fits your specific needs.
25
26
@@ -32,47 +33,35 @@ There are two approaches to integrate a custom parser:
32
33
33
34
The <ApiLinkto="class/HttpCrawler">`HttpCrawler`</ApiLink> gives you direct access to raw HTTP responses, allowing you to integrate any parsing library of your choice. When using this approach, helpers like <ApiLinkto="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> and <ApiLinkto="class/ExtractLinksFunction">`extract_links`</ApiLink> are not available, and it requires minimal setup.
34
35
35
-
The following sections demonstrate how to use various parsing libraries with <ApiLinkto="class/HttpCrawler">`HttpCrawler`</ApiLink> to extract data from a page and enqueue discovered links for further crawling.
36
-
37
-
### lxml
38
-
39
-
[lxml](https://lxml.de/) is a high-performance XML and HTML parser that provides Python bindings to the C libraries libxml2 and libxslt. It supports XPath 1.0, XSLT 1.0, and EXSLT extensions for element selection. The `make_links_absolute` method is particularly useful for converting relative URLs to absolute ones before link extraction.
Using [SaxonC-HE](https://pypi.org/project/saxonche/) together with lxml enables XPath 3.1 support, which provides advanced features like `distinct-values()` function and more powerful string manipulation. In this setup, lxml converts HTML to well-formed XML that SaxonC-HE can process.
[selectolax](https://github.com/rushter/selectolax) is a fast HTML parser that offers two backends: the default `Modest` engine and `Lexbor`. It provides a simple API with CSS selector support. The example below uses the `Lexbor` backend for optimal performance.
[PyQuery](https://pyquery.readthedocs.io/) brings jQuery-like syntax to Python for HTML manipulation. Built on top of `lxml`, it combines familiar jQuery CSS selectors with Python's ease of use. This is a good choice if you're comfortable with jQuery syntax and want a straightforward API for DOM traversal and manipulation.
[Scrapling](https://github.com/D4Vinci/Scrapling) is a scraping library that provides both CSS selectors and XPath 1.0. It offers automatic text extraction and a Scrapy/BeautifulSoup-like API with pseudo-elements support similar to Parsel.
The following examples demonstrate integration with various parsing libraries: [lxml](https://lxml.de/) for high-performance XPath 1.0 parsing, [lxml with SaxonC-HE](https://pypi.org/project/saxonche/) for XPath 3.1 support, [selectolax](https://github.com/rushter/selectolax) for fast CSS selector-based parsing, [PyQuery](https://pyquery.readthedocs.io/) for jQuery-like syntax, and [scrapling](https://github.com/D4Vinci/Scrapling) for CSS and XPath selectors with Scrapy/Parsel-like API and BeautifulSoup-style find methods.
@@ -106,11 +95,21 @@ The crawler class connects the parser and context. Extend <ApiLink to="class/Abs
106
95
107
96
### Using the crawler
108
97
109
-
The custom crawler works like any built-in crawler. Request handlers receive your custom context with full access to framework helpers like <ApiLinkto="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>:
98
+
The custom crawler works like any built-in crawler. Request handlers receive your custom context with full access to framework helpers like <ApiLinkto="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>. Additionally, the custom parser can be used with <ApiLinkto="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> for adaptive crawling:
0 commit comments