|
| 1 | +<a id="crawler"></a> |
| 2 | + |
| 3 | +# Module crawler |
| 4 | + |
| 5 | +<a id="crawler.Crawler"></a> |
| 6 | + |
| 7 | +## Crawler |
| 8 | + |
| 9 | +```python |
| 10 | +class Crawler(BaseComponent) |
| 11 | +``` |
| 12 | + |
| 13 | +Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc. |
| 14 | + |
| 15 | +**Example:** |
| 16 | +```python |
| 17 | +| from haystack.nodes.connector import Crawler |
| 18 | +| |
| 19 | +| crawler = Crawler(output_dir="crawled_files") |
| 20 | +| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/ |
| 21 | +| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"], |
| 22 | +| filter_urls= ["haystack.deepset.ai/overview/"]) |
| 23 | +``` |
| 24 | + |
| 25 | +<a id="crawler.Crawler.__init__"></a> |
| 26 | + |
| 27 | +#### Crawler.\_\_init\_\_ |
| 28 | + |
| 29 | +```python |
| 30 | +def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None, extract_hidden_text=True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) |
| 31 | +``` |
| 32 | + |
| 33 | +Init object with basic params for crawling (can be overwritten later). |
| 34 | + |
| 35 | +**Arguments**: |
| 36 | + |
| 37 | +- `output_dir`: Path for the directory to store files |
| 38 | +- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl()) |
| 39 | +- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options: |
| 40 | +0: Only initial list of urls |
| 41 | +1: Follow links found on the initial URLs (but no further) |
| 42 | +- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with. |
| 43 | +All URLs not matching at least one of the regular expressions will be dropped. |
| 44 | +- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content |
| 45 | +- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's |
| 46 | +attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are |
| 47 | +not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]). |
| 48 | +In this case the id will be generated by using the content and the defined metadata. |
| 49 | +- `extract_hidden_text`: Whether to extract the hidden text contained in page. |
| 50 | +E.g. the text can be inside a span with style="display: none" |
| 51 | +- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on |
| 52 | +dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted. |
| 53 | +E.g. 2: Crawler will wait 2 seconds before scraping page |
| 54 | +- `crawler_naming_function`: A function mapping the crawled page to a file name. |
| 55 | +By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url. |
| 56 | +E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link) |
| 57 | + This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores. |
| 58 | + 2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest() |
| 59 | + This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content. |
| 60 | + |
| 61 | +<a id="crawler.Crawler.crawl"></a> |
| 62 | + |
| 63 | +#### Crawler.crawl |
| 64 | + |
| 65 | +```python |
| 66 | +def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = None, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> List[Path] |
| 67 | +``` |
| 68 | + |
| 69 | +Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON |
| 70 | + |
| 71 | +file per URL, including text and basic meta data). |
| 72 | +You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern. |
| 73 | +All parameters are optional here and only meant to overwrite instance attributes at runtime. |
| 74 | +If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used. |
| 75 | + |
| 76 | +**Arguments**: |
| 77 | + |
| 78 | +- `output_dir`: Path for the directory to store files |
| 79 | +- `urls`: List of http addresses or single http address |
| 80 | +- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options: |
| 81 | +0: Only initial list of urls |
| 82 | +1: Follow links found on the initial URLs (but no further) |
| 83 | +- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with. |
| 84 | +All URLs not matching at least one of the regular expressions will be dropped. |
| 85 | +- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content |
| 86 | +- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's |
| 87 | +attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are |
| 88 | +not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]). |
| 89 | +In this case the id will be generated by using the content and the defined metadata. |
| 90 | +- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on |
| 91 | +dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted. |
| 92 | +E.g. 2: Crawler will wait 2 seconds before scraping page |
| 93 | +- `crawler_naming_function`: A function mapping the crawled page to a file name. |
| 94 | +By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url. |
| 95 | +E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link) |
| 96 | + This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores. |
| 97 | + 2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest() |
| 98 | + This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content. |
| 99 | + |
| 100 | +**Returns**: |
| 101 | + |
| 102 | +List of paths where the crawled webpages got stored |
| 103 | + |
| 104 | +<a id="crawler.Crawler.run"></a> |
| 105 | + |
| 106 | +#### Crawler.run |
| 107 | + |
| 108 | +```python |
| 109 | +def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> Tuple[Dict[str, Union[List[Document], List[Path]]], str] |
| 110 | +``` |
| 111 | + |
| 112 | +Method to be executed when the Crawler is used as a Node within a Haystack pipeline. |
| 113 | + |
| 114 | +**Arguments**: |
| 115 | + |
| 116 | +- `output_dir`: Path for the directory to store files |
| 117 | +- `urls`: List of http addresses or single http address |
| 118 | +- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options: |
| 119 | +0: Only initial list of urls |
| 120 | +1: Follow links found on the initial URLs (but no further) |
| 121 | +- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with. |
| 122 | +All URLs not matching at least one of the regular expressions will be dropped. |
| 123 | +- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content |
| 124 | +- `return_documents`: Return json files content |
| 125 | +- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's |
| 126 | +attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are |
| 127 | +not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]). |
| 128 | +In this case the id will be generated by using the content and the defined metadata. |
| 129 | +- `extract_hidden_text`: Whether to extract the hidden text contained in page. |
| 130 | +E.g. the text can be inside a span with style="display: none" |
| 131 | +- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on |
| 132 | +dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted. |
| 133 | +E.g. 2: Crawler will wait 2 seconds before scraping page |
| 134 | +- `crawler_naming_function`: A function mapping the crawled page to a file name. |
| 135 | +By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url. |
| 136 | +E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link) |
| 137 | + This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores. |
| 138 | + 2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest() |
| 139 | + This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content. |
| 140 | + |
| 141 | +**Returns**: |
| 142 | + |
| 143 | +Tuple({"paths": List of filepaths, ...}, Name of output edge) |
| 144 | + |
0 commit comments