You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,11 @@ Developed at <a href="https://www.informatik.hu-berlin.de/en/forschung-en/gebiet
25
25
26
26
---
27
27
28
+
***Disclaimer**: Although we try to provide an indication of whether a publisher has not explicitly objected to the training of AI models on its data, we would like to point out that this information must be verified independently before their content is used.
29
+
More details can be found [here](docs/5_advanced_topics.md#filtering-publishers-for-ai-training).*
Copy file name to clipboardExpand all lines: docs/5_advanced_topics.md
+8Lines changed: 8 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,6 +4,7 @@
4
4
*[How to search for publishers](#how-to-search-for-publishers)
5
5
*[Using `search()`](#using-search)
6
6
*[Working with deprecated publishers](#working-with-deprecated-publishers)
7
+
*[Filtering publishers for AI training](#filtering-publishers-for-ai-training)
7
8
8
9
# Advanced Topics
9
10
@@ -33,4 +34,11 @@ When we notice that a publisher is uncrawlable for whatever reason, we will mark
33
34
This mostly has internal usages, since the default value for the `Crawler``ignore_deprecated` flag is `False`.
34
35
You can alter this behaviour when initiating the `Crawler` and setting the `ignore_deprecated` flag.
35
36
37
+
## Filtering publishers for AI training
38
+
39
+
Some publishers explicitly disallow the use of their content for AI training purposes.
40
+
We _try_ to respect these wishes by introducing the `skip_publishers_disallowing_training` parameter in the `crawl()` function.
41
+
Users intending to use Fundus to gather training data for AI models should set this parameter to `True` to avoid collecting articles from publishers that wish for their content to not be used in this way.
42
+
Yet, as publishers are not required to mention this in their robots.txt file, users should additionally check the terms of use of the publishers they want to crawl and set the `disallows_training` attribute of the `Publisher` class accordingly.
43
+
36
44
In the [next section](6_logging.md) we introduce you to Fundus logging mechanics.
0 commit comments