Skip to content

Commit 6391fb4

Browse files
committed
update documentation about ArticleBody and how to filter noisy sitemaps
1 parent 716b827 commit 6391fb4

File tree

1 file changed

+22
-10
lines changed

1 file changed

+22
-10
lines changed

docs/how_to_add_a_publisher.md

Lines changed: 22 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -239,19 +239,27 @@ You can check if a sitemap is a news map by:
239239
E.g. `<urlset ... xmlns:news="http://www.google.com/schemas/sitemap-news/0.9" ... >`<br>
240240
**_NOTE:_** This can only be found within the actual sitemap and not the index map.
241241
242+
#### Filter noisy sitemaps
243+
244+
Sometimes sitemaps can include a lot of noise like maps pointing to a collection of tags or authors, etc.
245+
You can use the `sitemap_filter` parameter of `Sitemap` or `NewsMap` to prefilter these based on a regular expression.
246+
E.g.
247+
```` python
248+
Sitemap("https://apnews.com/sitemap.xml", sitemap_filter=regex_filter("apnews.com/hub/|apnews.com/video/"))
249+
````
250+
Will filter out all URLs encountered within the processing of the `Sitemap` object including either the string `apnews.com/hub/` or `apnews.com/video/`.
251+
Alternatively:
252+
````python
253+
sitemap_filter=inverse(regex_filter("sitemap-content-"))
254+
````
255+
will exclude all sitemap URLs not containing the substring `sitemap-content-`.
256+
242257
### Finishing the Publisher Specification
243258
244-
1. Sometimes sitemaps can include a lot of noise like maps pointing to a collection of tags or authors, etc.
245-
You can use the `sitemap_filter` parameter of `Sitemap` or `NewsMap` to prefilter these based on a regular expression.
246-
E.g.
247-
```` python
248-
Sitemap("https://apnews.com/sitemap.xml", sitemap_filter=regex_filter("apnews.com/hub/|apnews.com/video/"))
249-
````
250-
Will filter out all URLs encountered within the processing of the `Sitemap` object including either the string `apnews.com/hub/` or `apnews.com/video/`.
251-
2. If your publisher requires to use custom request headers to work properly you can alter it by using the `request_header` parameter of `PublisherSpec`.
259+
1. If your publisher requires to use custom request headers to work properly you can alter it by using the `request_header` parameter of `PublisherSpec`.
252260
The default is: `{"user_agent": "Fundus"}`.
253-
3. If you want to block URLs for the entire publisher use the `url_filter` parameter of `Publisher`.
254-
4. In some cases it can be necessary to append query parameters to the end of the URL, e.g. to load the article as one page. This can be achieved by adding the `query_parameter` attribute of `PublisherSpec` and assigning it a dictionary object containing the key - value pairs: e.g. `{"page": "all"}`. These key - value pairs will be appended to all crawled URLs.
261+
2. If you want to block URLs for the entire publisher use the `url_filter` parameter of `Publisher`.
262+
3. In some cases it can be necessary to append query parameters to the end of the URL, e.g. to load the article as one page. This can be achieved by adding the `query_parameter` attribute of `PublisherSpec` and assigning it a dictionary object containing the key - value pairs: e.g. `{"page": "all"}`. These key - value pairs will be appended to all crawled URLs.
255263
256264
Now, let's put it all together to specify The Intercept as a new publisher in Fundus:
257265
@@ -554,6 +562,10 @@ To accurately extract the body of an article, use the `extract_article_body_with
554562
This function accepts selectors for the different body parts as input and returns a parsed `ArticleBody`.
555563
For practical examples, refer to existing parser implementations to understand how everything integrates.
556564

565+
> [!IMPORTANT]
566+
> Regardless of the article's layout, the extracted `ArticleBody` should closely mirror the actual body/text of the article and must not include any additional content.
567+
> This ensures that the text can be accurately mapped back to the HTML for annotation purposes.
568+
557569
### Extracting the images
558570

559571
Fundus offers a utility function `image_extraction` to extract images from the article.

0 commit comments

Comments
 (0)