|
| 1 | +--- |
| 2 | +id: adaptive-playwright-crawler |
| 3 | +title: AdaptivePlaywrightCrawler |
| 4 | +description: How to use the AdaptivePlaywrightCrawler. |
| 5 | +--- |
| 6 | + |
| 7 | +import ApiLink from '@site/src/components/ApiLink'; |
| 8 | +import CodeBlock from '@theme/CodeBlock'; |
| 9 | +import Tabs from '@theme/Tabs'; |
| 10 | +import TabItem from '@theme/TabItem'; |
| 11 | + |
| 12 | +import AdaptivePlaywrightCrawlerInitBeautifulSoup from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_beautifulsoup.py'; |
| 13 | +import AdaptivePlaywrightCrawlerInitParsel from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_parsel.py'; |
| 14 | +import AdaptivePlaywrightCrawlerInitPrediction from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_init_prediction.py'; |
| 15 | +import AdaptivePlaywrightCrawlerHandler from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_handler.py'; |
| 16 | +import AdaptivePlaywrightCrawlerPreNavHooks from '!!raw-loader!./code/adaptive_playwright_crawler/adaptive_playwright_crawler_pre_nav_hooks.py'; |
| 17 | + |
| 18 | + |
| 19 | + |
| 20 | +An <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is a combination of <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and some implementation of HTTP-based crawler such as <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> or <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. |
| 21 | +It uses a more limited crawling context interface so that it is able to switch to HTTP-only crawling when it detects that it may bring a performance benefit. |
| 22 | + |
| 23 | +Detection is done based on the <ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> with default implementation <ApiLink to="class/DefaultRenderingTypePredictor">`DefaultRenderingTypePredictor`</ApiLink>. It predicts which crawling method should be used and learns from already crawled pages. |
| 24 | + |
| 25 | +## When to use AdaptivePlaywrightCrawler |
| 26 | + |
| 27 | +Use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> in scenarios where some target pages have to be crawled with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, but for others faster HTTP-based crawler is sufficient. This way, you can achieve lower costs when crawling multiple different websites. |
| 28 | + |
| 29 | +Another use case is performing selector-based data extraction without prior knowledge of whether the selector exists in the static page or is dynamically added by a code executed in a browsing client. |
| 30 | + |
| 31 | +## Request handler and adaptive context helpers |
| 32 | + |
| 33 | +Request handler for <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> works on special context type - <ApiLink to="class/AdaptivePlaywrightCrawlingContext">`AdaptivePlaywrightCrawlingContext`</ApiLink>. This context is sometimes created by HTTP-based sub crawler and sometimes by playwright based sub crawler. Due to its dynamic nature, you can't always access [page](https://playwright.dev/python/docs/api/class-page) object. To overcome this limitation, there are three helper methods on this context that can be called regardless of how the context was created. |
| 34 | + |
| 35 | +<ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> accepts `css` selector as first argument and timeout as second argument. The function will try to locate this selector a return once it is found(within timeout). In practice this means that if HTTP-based sub crawler was used, the function will find the selector only if it is part of the static content. If not, the adaptive crawler will fall back to the playwright sub crawler and will wait try to locate the selector within the timeout using playwright. |
| 36 | + |
| 37 | +<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink> accepts `css` selector as first argument and timeout as second argument. This function acts similar to `wait_for_selector`, but it also returns one selector if any selector is found. Return value type is determined by used HTTP-based sub crawler. For example, it will be `Selector` for <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> and `Tag` for <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. |
| 38 | + |
| 39 | +<ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_all`</ApiLink> same as <ApiLink to="class/AdaptivePlaywrightCrawlingContext#query_selector_one">`query_selector_one`</ApiLink>, but returns all found selectors. |
| 40 | + |
| 41 | +<ApiLink to="class/AdaptivePlaywrightCrawlingContext#parse_with_static_parser">`parse_with_static_parser`</ApiLink> will re-parse the whole page. Return value type is determined by used HTTP-based sub crawler. It has optional arguments: `selector` and `timeout`. If those optional arguments are used then the function first calls <ApiLink to="class/AdaptivePlaywrightCrawlingContext#wait_for_selector">`wait_for_selector`</ApiLink> and then do the parsing. This can be used in scenario where some specific element can signal, that page is already complete. |
| 42 | + |
| 43 | +See the following example about how to create request handler and use context helpers: |
| 44 | + |
| 45 | +<CodeBlock className="language-python"> |
| 46 | + {AdaptivePlaywrightCrawlerHandler} |
| 47 | +</CodeBlock> |
| 48 | + |
| 49 | + |
| 50 | +## Crawler configuration |
| 51 | +To use <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is recommended to use one of the prepared factory methods that will create the crawler with specific HTTP-based sub crawler variant: <ApiLink to="class/AdaptivePlaywrightCrawler#with_beautifulsoup_static_parser">`AdaptivePlaywrightCrawler.with_beautifulsoup_static_parser`</ApiLink> or <ApiLink to="class/AdaptivePlaywrightCrawler#with_parsel_static_parser">`AdaptivePlaywrightCrawler.with_parsel_static_parser`</ApiLink>. |
| 52 | + |
| 53 | +<ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> is internally composed of two sub crawlers and you can do a detailed configuration of both of them. For detailed configuration options of the sub crawlers, please refer to their pages: <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink>, <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. |
| 54 | + |
| 55 | +In the following example you can see how to create and configure <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> with two different HTTP-based sub crawlers: |
| 56 | + |
| 57 | + |
| 58 | +<Tabs> |
| 59 | + <TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler" default> |
| 60 | +<CodeBlock className="language-python"> |
| 61 | + {AdaptivePlaywrightCrawlerInitBeautifulSoup} |
| 62 | +</CodeBlock> |
| 63 | + </TabItem> |
| 64 | +<TabItem value="ParselCrawler" label="ParselCrawler" default> |
| 65 | +<CodeBlock className="language-python"> |
| 66 | + {AdaptivePlaywrightCrawlerInitParsel} |
| 67 | +</CodeBlock> |
| 68 | + </TabItem> |
| 69 | +</Tabs> |
| 70 | + |
| 71 | +### Prediction related arguments |
| 72 | + |
| 73 | +To control which pages are crawled by which method you can use following arguments: |
| 74 | + |
| 75 | +<ApiLink to="class/RenderingTypePredictor">`RenderingTypePredictor`</ApiLink> - Class that can give recommendations about which sub crawler should be used for specific url. Predictor will also recommend to use both sub crawlers for some page from time to time, to check that the given recommendation was correct. Predictor should be able to learn from previous results and gradually give more reliable recommendations. |
| 76 | + |
| 77 | +`result_checker` - Is a function that checks result created from crawling a page. By default, it always returns `True`. |
| 78 | + |
| 79 | +`result_comparator` - Is a function that compares two results (HTTP-based sub crawler result and playwright based sub crawler result) and returns `True` if they are considered the same. By default, this function compares calls of context helper `push_data` by each sub crawler. This function is used by `rendering_type_predictor` to evaluate whether HTTP-based crawler has the same results as playwright based sub crawler. |
| 80 | + |
| 81 | +See the following example about how to pass prediction related arguments: |
| 82 | + |
| 83 | +<CodeBlock className="language-python"> |
| 84 | + {AdaptivePlaywrightCrawlerInitPrediction} |
| 85 | +</CodeBlock> |
| 86 | + |
| 87 | + |
| 88 | + |
| 89 | +## Page configuration with pre-navigation hooks |
| 90 | +In some use cases, you may need to configure the [page](https://playwright.dev/python/docs/api/class-page) before it navigates to the target URL. For instance, you might set navigation timeouts or manipulate other page-level settings. For such cases you can use the <ApiLink to="class/AdaptivePlaywrightCrawler#pre_navigation_hook">`pre_navigation_hook`</ApiLink> method of the <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink>. This method is called before the page navigates to the target URL and allows you to configure the page instance. Due to the dynamic nature of <ApiLink to="class/AdaptivePlaywrightCrawler">`AdaptivePlaywrightCrawler`</ApiLink> it is possible that the hook will be executed for HTTP-based sub crawler or playwright-based sub crawler. Using [page](https://playwright.dev/python/docs/api/class-page) object for hook that will be executed on HTTP-based sub crawler will raise an exception. To overcome this you can use optional argument `playwright_only` = `True` when registering the hook. |
| 91 | + |
| 92 | +See the following example about how to register the pre navigation hooks: |
| 93 | + |
| 94 | +<CodeBlock className="language-python"> |
| 95 | + {AdaptivePlaywrightCrawlerPreNavHooks} |
| 96 | +</CodeBlock> |
0 commit comments