You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Will filter out all URLs encountered within the processing of the `Sitemap` object including either the string `apnews.com/hub/` or `apnews.com/video/`.
Will filter out all URLs encountered within the processing of the `Sitemap` object including either the string `apnews.com/hub/` or `apnews.com/video/`.
251
-
2. If your publisher requires to use custom request headers to work properly you can alter it by using the `request_header` parameter of `PublisherSpec`.
259
+
1. If your publisher requires to use custom request headers to work properly you can alter it by using the `request_header` parameter of `PublisherSpec`.
252
260
The default is: `{"user_agent": "Fundus"}`.
253
-
3. If you want to block URLs for the entire publisher use the `url_filter` parameter of `Publisher`.
254
-
4. In some cases it can be necessary to append query parameters to the end of the URL, e.g. to load the article as one page. This can be achieved by adding the `query_parameter` attribute of `PublisherSpec` and assigning it a dictionary object containing the key - value pairs: e.g. `{"page": "all"}`. These key - value pairs will be appended to all crawled URLs.
261
+
2. If you want to block URLs for the entire publisher use the `url_filter` parameter of `Publisher`.
262
+
3. In some cases it can be necessary to append query parameters to the end of the URL, e.g. to load the article as one page. This can be achieved by adding the `query_parameter` attribute of `PublisherSpec` and assigning it a dictionary object containing the key - value pairs: e.g. `{"page": "all"}`. These key - value pairs will be appended to all crawled URLs.
255
263
256
264
Now, let's put it all together to specify The Intercept as a new publisher in Fundus:
257
265
@@ -554,6 +562,10 @@ To accurately extract the body of an article, use the `extract_article_body_with
554
562
This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
555
563
For practical examples, refer to existing parser implementations to understand how everything integrates.
556
564
565
+
> [!IMPORTANT]
566
+
> Regardless of the article's layout, the extracted `ArticleBody` should closely mirror the actual body/text of the article and must not include any additional content.
567
+
> This ensures that the text can be accurately mapped back to the HTML for annotation purposes.
568
+
557
569
### Extracting the images
558
570
559
571
Fundus offers a utility function `image_extraction` to extract images from the article.
0 commit comments