Skip to content

Commit baefd32

Browse files
authored
Upgrade to v1.7.0 and copy docs folder (#3014)
* update version to 1.7.0 * copy docs * update openapi * generate schemas * make update_json_schema() idempotent * update docs, schema and openapi
1 parent d617553 commit baefd32

File tree

99 files changed

+36356
-13
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

99 files changed

+36356
-13
lines changed

VERSION.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
1.6.1rc0
1+
1.7.0

docs/_src/api/openapi/openapi-1.7.0.json

Lines changed: 886 additions & 0 deletions
Large diffs are not rendered by default.

docs/_src/api/openapi/openapi.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
"openapi": "3.0.2",
33
"info": {
44
"title": "Haystack REST API",
5-
"version": "1.6.1rc0"
5+
"version": "1.7.0"
66
},
77
"paths": {
88
"/initialized": {

docs/v1.7.0/Makefile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
7+
SPHINXBUILD := sphinx-build
8+
MAKEINFO := makeinfo
9+
10+
BUILDDIR := build
11+
SOURCE := _src/
12+
# SPHINXFLAGS := -a -W -n -A local=1 -d $(BUILDDIR)/doctree
13+
SPHINXFLAGS := -A local=1 -d $(BUILDDIR)/doctree
14+
SPHINXOPTS := $(SPHINXFLAGS) $(SOURCE)
15+
16+
# Put it first so that "make" without argument is like "make help".
17+
help:
18+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
19+
20+
.PHONY: help Makefile
21+
22+
# Catch-all target: route all unknown targets to Sphinx using the new
23+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
24+
%: Makefile
25+
$(SPHINXBUILD) -M $@ $(SPHINXOPTS) $(BUILDDIR)/$@

docs/v1.7.0/_src/api/Makefile

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Minimal makefile for Sphinx documentation
2+
#
3+
4+
# You can set these variables from the command line, and also
5+
# from the environment for the first two.
6+
SPHINXOPTS ?=
7+
SPHINXBUILD ?= sphinx-build
8+
SOURCEDIR = .
9+
BUILDDIR = _build
10+
11+
# Put it first so that "make" without argument is like "make help".
12+
help:
13+
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
14+
15+
.PHONY: help Makefile
16+
17+
# Catch-all target: route all unknown targets to Sphinx using the new
18+
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
19+
%: Makefile
20+
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
div.sphinxsidebarwrapper {
2+
position: relative;
3+
top: 0px;
4+
padding: 0;
5+
}
6+
7+
div.sphinxsidebar {
8+
margin: 0;
9+
padding: 0 15px 0 15px;
10+
width: 210px;
11+
float: left;
12+
font-size: 1em;
13+
text-align: left;
14+
}
15+
16+
div.sphinxsidebar .logo {
17+
font-size: 1.8em;
18+
color: #0A507A;
19+
font-weight: 300;
20+
text-align: center;
21+
}
22+
23+
div.sphinxsidebar .logo img {
24+
vertical-align: middle;
25+
}
26+
27+
div.sphinxsidebar .download a img {
28+
vertical-align: middle;
29+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
{# put the sidebar before the body #}
2+
{% block sidebar1 %}{{ sidebar() }}{% endblock %}
3+
{% block sidebar2 %}{% endblock %}
4+
5+
{% block extrahead %}
6+
<link href='https://fonts.googleapis.com/css?family=Open+Sans:300,400,700'
7+
rel='stylesheet' type='text/css' />
8+
{{ super() }}
9+
{#- if not embedded #}
10+
<style type="text/css">
11+
table.right { float: left; margin-left: 20px; }
12+
table.right td { border: 1px solid #ccc; }
13+
{% if pagename == 'index' %}
14+
.related { display: none; }
15+
{% endif %}
16+
</style>
17+
<script>
18+
// intelligent scrolling of the sidebar content
19+
$(window).scroll(function() {
20+
var sb = $('.sphinxsidebarwrapper');
21+
var win = $(window);
22+
var sbh = sb.height();
23+
var offset = $('.sphinxsidebar').position()['top'];
24+
var wintop = win.scrollTop();
25+
var winbot = wintop + win.innerHeight();
26+
var curtop = sb.position()['top'];
27+
var curbot = curtop + sbh;
28+
// does sidebar fit in window?
29+
if (sbh < win.innerHeight()) {
30+
// yes: easy case -- always keep at the top
31+
sb.css('top', $u.min([$u.max([0, wintop - offset - 10]),
32+
$(document).height() - sbh - 200]));
33+
} else {
34+
// no: only scroll if top/bottom edge of sidebar is at
35+
// top/bottom edge of window
36+
if (curtop > wintop && curbot > winbot) {
37+
sb.css('top', $u.max([wintop - offset - 10, 0]));
38+
} else if (curtop < wintop && curbot < winbot) {
39+
sb.css('top', $u.min([winbot - sbh - offset - 20,
40+
$(document).height() - sbh - 200]));
41+
}
42+
}
43+
});
44+
</script>
45+
{#- endif #}
46+
{% endblock %}
Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
<a id="crawler"></a>
2+
3+
# Module crawler
4+
5+
<a id="crawler.Crawler"></a>
6+
7+
## Crawler
8+
9+
```python
10+
class Crawler(BaseComponent)
11+
```
12+
13+
Crawl texts from a website so that we can use them later in Haystack as a corpus for search / question answering etc.
14+
15+
**Example:**
16+
```python
17+
| from haystack.nodes.connector import Crawler
18+
|
19+
| crawler = Crawler(output_dir="crawled_files")
20+
| # crawl Haystack docs, i.e. all pages that include haystack.deepset.ai/overview/
21+
| docs = crawler.crawl(urls=["https://haystack.deepset.ai/overview/get-started"],
22+
| filter_urls= ["haystack.deepset.ai/overview/"])
23+
```
24+
25+
<a id="crawler.Crawler.__init__"></a>
26+
27+
#### Crawler.\_\_init\_\_
28+
29+
```python
30+
def __init__(output_dir: str, urls: Optional[List[str]] = None, crawler_depth: int = 1, filter_urls: Optional[List] = None, overwrite_existing_files=True, id_hash_keys: Optional[List[str]] = None, extract_hidden_text=True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None)
31+
```
32+
33+
Init object with basic params for crawling (can be overwritten later).
34+
35+
**Arguments**:
36+
37+
- `output_dir`: Path for the directory to store files
38+
- `urls`: List of http(s) address(es) (can also be supplied later when calling crawl())
39+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
40+
0: Only initial list of urls
41+
1: Follow links found on the initial URLs (but no further)
42+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
43+
All URLs not matching at least one of the regular expressions will be dropped.
44+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
45+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
46+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
47+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
48+
In this case the id will be generated by using the content and the defined metadata.
49+
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
50+
E.g. the text can be inside a span with style="display: none"
51+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
52+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
53+
E.g. 2: Crawler will wait 2 seconds before scraping page
54+
- `crawler_naming_function`: A function mapping the crawled page to a file name.
55+
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
56+
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
57+
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
58+
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
59+
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
60+
61+
<a id="crawler.Crawler.crawl"></a>
62+
63+
#### Crawler.crawl
64+
65+
```python
66+
def crawl(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = None, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> List[Path]
67+
```
68+
69+
Craw URL(s), extract the text from the HTML, create a Haystack Document object out of it and save it (one JSON
70+
71+
file per URL, including text and basic meta data).
72+
You can optionally specify via `filter_urls` to only crawl URLs that match a certain pattern.
73+
All parameters are optional here and only meant to overwrite instance attributes at runtime.
74+
If no parameters are provided to this method, the instance attributes that were passed during __init__ will be used.
75+
76+
**Arguments**:
77+
78+
- `output_dir`: Path for the directory to store files
79+
- `urls`: List of http addresses or single http address
80+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
81+
0: Only initial list of urls
82+
1: Follow links found on the initial URLs (but no further)
83+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
84+
All URLs not matching at least one of the regular expressions will be dropped.
85+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
86+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
87+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
88+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
89+
In this case the id will be generated by using the content and the defined metadata.
90+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
91+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
92+
E.g. 2: Crawler will wait 2 seconds before scraping page
93+
- `crawler_naming_function`: A function mapping the crawled page to a file name.
94+
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
95+
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
96+
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
97+
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
98+
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
99+
100+
**Returns**:
101+
102+
List of paths where the crawled webpages got stored
103+
104+
<a id="crawler.Crawler.run"></a>
105+
106+
#### Crawler.run
107+
108+
```python
109+
def run(output_dir: Union[str, Path, None] = None, urls: Optional[List[str]] = None, crawler_depth: Optional[int] = None, filter_urls: Optional[List] = None, overwrite_existing_files: Optional[bool] = None, return_documents: Optional[bool] = False, id_hash_keys: Optional[List[str]] = None, extract_hidden_text: Optional[bool] = True, loading_wait_time: Optional[int] = None, crawler_naming_function: Optional[Callable[[str, str], str]] = None) -> Tuple[Dict[str, Union[List[Document], List[Path]]], str]
110+
```
111+
112+
Method to be executed when the Crawler is used as a Node within a Haystack pipeline.
113+
114+
**Arguments**:
115+
116+
- `output_dir`: Path for the directory to store files
117+
- `urls`: List of http addresses or single http address
118+
- `crawler_depth`: How many sublinks to follow from the initial list of URLs. Current options:
119+
0: Only initial list of urls
120+
1: Follow links found on the initial URLs (but no further)
121+
- `filter_urls`: Optional list of regular expressions that the crawled URLs must comply with.
122+
All URLs not matching at least one of the regular expressions will be dropped.
123+
- `overwrite_existing_files`: Whether to overwrite existing files in output_dir with new content
124+
- `return_documents`: Return json files content
125+
- `id_hash_keys`: Generate the document id from a custom list of strings that refer to the document's
126+
attributes. If you want to ensure you don't have duplicate documents in your DocumentStore but texts are
127+
not unique, you can modify the metadata and pass e.g. `"meta"` to this field (e.g. [`"content"`, `"meta"`]).
128+
In this case the id will be generated by using the content and the defined metadata.
129+
- `extract_hidden_text`: Whether to extract the hidden text contained in page.
130+
E.g. the text can be inside a span with style="display: none"
131+
- `loading_wait_time`: Seconds to wait for page loading before scraping. Recommended when page relies on
132+
dynamic DOM manipulations. Use carefully and only when needed. Crawler will have scraping speed impacted.
133+
E.g. 2: Crawler will wait 2 seconds before scraping page
134+
- `crawler_naming_function`: A function mapping the crawled page to a file name.
135+
By default, the file name is generated from the processed page url (string compatible with Mac, Unix and Windows paths) and the last 6 digits of the MD5 sum of this unprocessed page url.
136+
E.g. 1) crawler_naming_function=lambda url, page_content: re.sub("[<>:'/\\|?*\0 ]", "_", link)
137+
This example will generate a file name from the url by replacing all characters that are not allowed in file names with underscores.
138+
2) crawler_naming_function=lambda url, page_content: hashlib.md5(f"{url}{page_content}".encode("utf-8")).hexdigest()
139+
This example will generate a file name from the url and the page content by using the MD5 hash of the concatenation of the url and the page content.
140+
141+
**Returns**:
142+
143+
Tuple({"paths": List of filepaths, ...}, Name of output edge)
144+

0 commit comments

Comments
 (0)