Skip to content

Commit b139777

Browse files
authored
Merge branch 'master' into add-swp-publisher
2 parents b3896bd + 716b827 commit b139777

File tree

554 files changed

+55213
-1714
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

554 files changed

+55213
-1714
lines changed

.github/workflows/publish-package.yml

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,28 @@ on:
77
release:
88
types:
99
- released
10+
workflow_dispatch:
1011

1112
jobs:
1213

14+
permission-check:
15+
runs-on: ubuntu-latest
16+
17+
steps:
18+
- name: Guard `workflow_dispatch`
19+
if: github.event_name == 'workflow_dispatch'
20+
id: check-admin
21+
run: |
22+
RESPONSE=$(curl -s -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \
23+
https://api.github.com/repos/${{ github.repository }}/collaborators/${{ github.actor }}/permission)
24+
25+
PERMISSION=$(echo "$RESPONSE" | jq -r '.permission')
26+
27+
if [[ "$PERMISSION" != "admin" ]]; then
28+
echo "User ${{ github.actor }} does not have admin rights."
29+
exit 1
30+
fi
31+
1332
test:
1433
name: Test the latest release commit
1534
uses: ./.github/workflows/tests.yml
@@ -23,6 +42,7 @@ jobs:
2342
needs:
2443
- test
2544
- lint
45+
- permission-check
2646
runs-on: ubuntu-latest
2747

2848
steps:
@@ -43,7 +63,7 @@ jobs:
4363
run: python3 -m build
4464

4565
- name: Store the distribution packages
46-
uses: actions/upload-artifact@v3
66+
uses: actions/upload-artifact@v4
4767
with:
4868
name: python-package-distributions
4969
path: dist/
@@ -63,7 +83,7 @@ jobs:
6383

6484
steps:
6585
- name: Download all the dists
66-
uses: actions/download-artifact@v3
86+
uses: actions/download-artifact@v4
6787
with:
6888
name: python-package-distributions
6989
path: dist/
@@ -72,6 +92,8 @@ jobs:
7292
uses: pypa/gh-action-pypi-publish@release/v1
7393
with:
7494
repository-url: https://test.pypi.org/legacy/
95+
verbose: true
96+
7597

7698
- name: Sleep for 2 minutes
7799
run: sleep 2m
@@ -113,11 +135,13 @@ jobs:
113135

114136
steps:
115137
- name: Download all the dists
116-
uses: actions/download-artifact@v3
138+
uses: actions/download-artifact@v4
117139
with:
118140
name: python-package-distributions
119141
path: dist/
120142

121143
- name: Publish distribution 📦 to PyPI
122144
uses: pypa/gh-action-pypi-publish@release/v1
145+
with:
146+
verbose: true
123147

.github/workflows/publisher_coverage.yaml

Lines changed: 13 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@ name: Publisher Coverage
22

33
on:
44
schedule:
5-
- cron: '0 1 * * *' # Runs at 01:00
5+
- cron: '0 14 * * *' # Runs at 14:00
66

77
workflow_dispatch:
88

99
jobs:
1010
validate_crawlers:
1111
runs-on: ubuntu-latest
12+
timeout-minutes: 30
1213

1314
steps:
1415
- name: Set up Git repository
@@ -25,12 +26,16 @@ jobs:
2526
run: pip install -e .
2627

2728
- name: Validate Crawlers
29+
env:
30+
PYTHONPATH: .
31+
# Set up a timeout to avoid long-running tests
32+
# We skip the Kicker APNews publishers, because they are IP-blocked
2833
run: |
2934
set -o pipefail
30-
exec python scripts/publisher_coverage.py | tee publisher_coverage.txt
35+
timeout 25m python -u scripts/publisher_coverage.py --skip Kicker APNews Tageblatt | tee publisher_coverage.txt
3136
3237
- name: Upload Coverage Report
33-
if: success() || failure()
38+
if: always()
3439
uses: actions/upload-artifact@v4
3540
with:
3641
name: Publisher Coverage
@@ -61,12 +66,13 @@ jobs:
6166
echo "TOTAL_PUBLISHERS=$(echo ${{ env.SUCCESS_RATE }} | grep -P -o '\d+' | tail -1)" >> $GITHUB_ENV
6267
echo "PASSED_PUBLISHERS=$(echo ${{ env.SUCCESS_RATE }} | grep -P -o '\d+' | head -1)" >> $GITHUB_ENV
6368
64-
- name: Get Red Threshold
65-
# We set the badge colour to red when at least one publisher failed the tests.
66-
run: echo "RED_THRESHOLD=$(( ${{ env.TOTAL_PUBLISHERS }} - 1 ))" >> $GITHUB_ENV
69+
- name: Get Thresholds
70+
# We set the badge colour to red when at least half of the publishers failed the tests.
71+
run: |
72+
echo "RED_THRESHOLD=$(( ${{ env.TOTAL_PUBLISHERS }} / 2 ))" >> $GITHUB_ENV
6773
6874
- name: Create Badge
69-
uses: schneegans/dynamic-badges-action@v1.6.0
75+
uses: schneegans/dynamic-badges-action@v1.7.0
7076
with:
7177
auth: ${{ secrets.DOBBERSC_GIST_SECRET }}
7278
gistID: ca0ae056b05cbfeaf30fa42f84ddf458

.github/workflows/tests.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,6 @@ jobs:
2828
if: steps.cache.outputs.cache-hit != 'true'
2929
run: |
3030
pip install -e .[dev]
31-
3231
- name: Run pytest
3332
run: python -m pytest -vv
3433

README.md

Lines changed: 120 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ Developed at <a href="https://www.informatik.hu-berlin.de/en/forschung-en/gebiet
1818
<div align="center">
1919
<hr>
2020

21-
[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://arxiv.org/abs/2403.15279)
21+
[Quick Start](#quick-start) | [Tutorials](#tutorials) | [News Sources](/docs/supported_publishers.md) | [Paper](https://aclanthology.org/2024.acl-demos.29/)
2222

2323
</div>
2424

@@ -68,24 +68,25 @@ That's already it!
6868
If you run this code, it should print out something like this:
6969

7070
```console
71-
Fundus-Article:
71+
Fundus-Article including 1 image(s):
7272
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
73-
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
74-
through committee votes on Thursday thanks to a last-minute [...]"
73+
- Text: "89-year-old California senator arrived hour late to Judiciary Committee hearing
74+
to advance President Biden's stalled nominations Democrats [...]"
7575
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
76-
- From: FreeBeacon (2023-05-11 18:41)
76+
- From: The Washington Free Beacon (2023-05-11 18:41)
7777

78-
Fundus-Article:
78+
Fundus-Article including 3 image(s):
7979
- Title: "Northwestern student government freezes College Republicans funding over [...]"
8080
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
8181
the funds of the university's chapter of College Republicans [...]"
8282
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
83-
- From: FoxNews (2023-05-09 14:37)
83+
- From: Fox News (2023-05-09 14:37)
8484
```
8585

8686
This printout tells you that you successfully crawled two articles!
8787

8888
For each article, the printout details:
89+
- the number of images included in the article
8990
- the "Title" of the article, i.e. its headline
9091
- the "Text", i.e. the main article body text
9192
- the "URL" from which it was crawled
@@ -94,7 +95,7 @@ For each article, the printout details:
9495

9596
## Example 2: Crawl a specific news source
9697

97-
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
98+
Maybe you want to crawl a specific news source instead. Let's crawl news articles from The New Yorker only:
9899

99100
```python
100101
from fundus import PublisherCollection, Crawler
@@ -107,21 +108,95 @@ for article in crawler.crawl(max_articles=2):
107108
print(article)
108109
```
109110

110-
## Example 3: Crawl articles from CC-NEWS
111+
## Example 3: Crawl 1 Million articles
111112

112-
If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).
113+
To crawl such a vast amount of data, Fundus relies on the `CommonCrawl` web archive, in particular the news crawl `CC-NEWS`.
114+
If you're not familiar with [`CommonCrawl`](https://commoncrawl.org/) or [`CC-NEWS`](https://commoncrawl.org/blog/news-dataset-available) check out their websites.
115+
Simply import our `CCNewsCrawler` and make sure to check out our [tutorial](docs/2_crawl_from_cc_news.md) beforehand.
113116

114117
````python
115118
from fundus import PublisherCollection, CCNewsCrawler
116119

117-
# initialize the crawler for news publishers based in the US
118-
crawler = CCNewsCrawler(*PublisherCollection.us)
120+
# initialize the crawler using all publishers supported by fundus
121+
crawler = CCNewsCrawler(*PublisherCollection)
119122

120-
# crawl 2 articles and print
121-
for article in crawler.crawl(max_articles=2):
123+
# crawl 1 million articles and print
124+
for article in crawler.crawl(max_articles=1000000):
122125
print(article)
123126
````
124127

128+
**_Note_**: By default, the crawler utilizes all available CPU cores on your system.
129+
For optimal performance, we recommend manually setting the number of processes using the `processes` parameter.
130+
A good rule of thumb is to allocate `one process per 200 Mbps of bandwidth`.
131+
This can vary depending on core speed.
132+
133+
**_Note_**: The crawl above took ~7 hours using the entire `PublisherCollection` on a machine with 1000 Mbps connection, Core i9-13905H, 64GB Ram, Windows 11 and without printing the articles.
134+
The estimated time can vary substantially depending on the publisher used and the available bandwidth.
135+
Additionally, not all publishers are included in the `CC-NEWS` crawl (especially US based publishers).
136+
For large corpus creation, one can also use the regular crawler by utilizing only sitemaps, which requires significantly less bandwidth.
137+
138+
````python
139+
from fundus import PublisherCollection, Crawler, Sitemap
140+
141+
# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
142+
crawler = Crawler(PublisherCollection.us, PublisherCollection.uk, restrict_sources_to=[Sitemap])
143+
144+
# crawl 1 million articles and print
145+
for article in crawler.crawl(max_articles=1000000):
146+
print(article)
147+
````
148+
149+
150+
## Example 4: Crawl some images
151+
152+
By default, Fundus tries to parse the images included in every crawled article.
153+
Let's crawl an article and print out the images for some more details.
154+
155+
```python
156+
from fundus import PublisherCollection, Crawler
157+
158+
# initialize the crawler for The LA Times
159+
crawler = Crawler(PublisherCollection.us.LATimes)
160+
161+
# crawl 1 article and print the images
162+
for article in crawler.crawl(max_articles=1):
163+
for image in article.images:
164+
print(image)
165+
```
166+
167+
For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:
168+
169+
```console
170+
Fundus-Article Cover-Image:
171+
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'
172+
-Description: 'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
173+
-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
174+
-Authors: ['Abbie Parr / Associated Press']
175+
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]
176+
177+
Fundus-Article Image:
178+
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'
179+
-Description: 'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'
180+
-Caption: 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'
181+
-Authors: ['Abbie Parr / Associated Press']
182+
-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]
183+
184+
Fundus-Article Image:
185+
-URL: 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'
186+
-Description: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
187+
-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
188+
-Authors: ['Gina Ferazzi / Los Angeles Times']
189+
-Versions: [320x218, 568x387, 768x524, 1024x698, 1200x818]
190+
```
191+
192+
For each image, the printout details:
193+
- The cover image designation (if applicable).
194+
- The URL for the highest-resolution version of the image.
195+
- A description of the image.
196+
- The image's caption.
197+
- The name of the copyright holder.
198+
- A list of all available versions of the image.
199+
125200

126201
## Tutorials
127202

@@ -131,7 +206,8 @@ We provide **quick tutorials** to get you started with the library:
131206
2. [**Tutorial 2: How to crawl articles from CC-NEWS**](docs/2_crawl_from_cc_news.md)
132207
3. [**Tutorial 3: The Article Class**](docs/3_the_article_class.md)
133208
4. [**Tutorial 4: How to filter articles**](docs/4_how_to_filter_articles.md)
134-
5. [**Tutorial 5: How to search for publishers**](docs/5_how_to_search_for_publishers.md)
209+
5. [**Tutorial 5: Advanced topics**](docs/5_advanced_topics.md)
210+
6. [**Tutorial 6: Logging**](docs/6_logging.md)
135211

136212
If you wish to contribute check out these tutorials:
137213
1. [**How to contribute**](docs/how_to_contribute.md)
@@ -143,32 +219,43 @@ You can find the publishers currently supported [**here**](/docs/supported_publi
143219

144220
Also: **Adding a new publisher is easy - consider contributing to the project!**
145221

146-
## Evaluation benchmark
222+
## Evaluation Benchmark
147223

148224
Check out our evaluation [benchmark](https://github.com/dobbersc/fundus-evaluation).
149225

150-
| **Scraper** | **Precision** | **Recall** | **F1-Score** |
151-
|-------------|---------------------------|---------------------------|---------------------------|
152-
| [Fundus](https://github.com/flairNLP/fundus) | **99.89**<sub>±0.57</sub> | 96.75<sub>±12.75</sub> | **97.69**<sub>±9.75</sub> |
153-
| [Trafilatura](https://github.com/adbar/trafilatura) | 90.54<sub>±18.86</sub> | 93.23<sub>±23.81</sub> | 89.81<sub>±23.69</sub> |
154-
| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py) | 81.09<sub>±19.41</sub> | **98.23**<sub>±8.61</sub> | 87.14<sub>±15.48</sub> |
155-
| [jusText](https://github.com/miso-belica/jusText) | 86.51<sub>±18.92</sub> | 90.23<sub>±20.61</sub> | 86.96<sub>±19.76</sub> |
156-
| [news-please](https://github.com/fhamborg/news-please) | 92.26<sub>±12.40</sub> | 86.38<sub>±27.59</sub> | 85.81<sub>±23.29</sub> |
157-
| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 84.73<sub>±20.82</sub> | 90.66<sub>±21.05</sub> | 85.77<sub>±20.28</sub> |
158-
| [Boilerpipe](https://github.com/kohlschutter/boilerpipe) | 82.89<sub>±20.65</sub> | 82.11<sub>±29.99</sub> | 79.90<sub>±25.86</sub> |
226+
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:
227+
228+
| **Scraper** | **Precision** | **Recall** | **F1-Score** | **Version** |
229+
|-----------------------------------------------------------------------------------------------------------------|:--------------------------|---------------------------|---------------------------|-------------|
230+
| [Fundus](https://github.com/flairNLP/fundus) | **99.89**<sub>±0.57</sub> | 96.75<sub>±12.75</sub> | **97.69**<sub>±9.75</sub> | 0.4.1 |
231+
| [Trafilatura](https://github.com/adbar/trafilatura) | 93.91<sub>±12.89</sub> | 96.85<sub>±15.69</sub> | 93.62<sub>±16.73</sub> | 1.12.0 |
232+
| [news-please](https://github.com/fhamborg/news-please) | 97.95<sub>±10.08</sub> | 91.89<sub>±16.15</sub> | 93.39<sub>±14.52</sub> | 1.6.13 |
233+
| [BTE](https://github.com/dobbersc/fundus-evaluation/blob/master/src/fundus_evaluation/scrapers/bte.py) | 81.09<sub>±19.41</sub> | **98.23**<sub>±8.61</sub> | 87.14<sub>±15.48</sub> | / |
234+
| [jusText](https://github.com/miso-belica/jusText) | 86.51<sub>±18.92</sub> | 90.23<sub>±20.61</sub> | 86.96<sub>±19.76</sub> | 3.0.1 |
235+
| [BoilerNet](https://github.com/dobbersc/fundus-evaluation/tree/master/src/fundus_evaluation/scrapers/boilernet) | 85.96<sub>±18.55</sub> | 91.21<sub>±19.15</sub> | 86.52<sub>±18.03</sub> | / |
236+
| [Boilerpipe](https://github.com/kohlschutter/boilerpipe) | 82.89<sub>±20.65</sub> | 82.11<sub>±29.99</sub> | 79.90<sub>±25.86</sub> | 1.3.0 |
159237

160238
## Cite
161239

162-
Please cite the following [paper](https://arxiv.org/abs/2403.15279) when using Fundus or building upon our work:
240+
Please cite the following [paper](https://aclanthology.org/2024.acl-demos.29/) when using Fundus or building upon our work:
163241

164242
```bibtex
165-
@misc{dallabetta2024fundus,
166-
title={Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions},
167-
author={Max Dallabetta and Conrad Dobberstein and Adrian Breiding and Alan Akbik},
168-
year={2024},
169-
eprint={2403.15279},
170-
archivePrefix={arXiv},
171-
primaryClass={cs.CL}
243+
@inproceedings{dallabetta-etal-2024-fundus,
244+
title = "Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",
245+
author = "Dallabetta, Max and
246+
Dobberstein, Conrad and
247+
Breiding, Adrian and
248+
Akbik, Alan",
249+
editor = "Cao, Yixin and
250+
Feng, Yang and
251+
Xiong, Deyi",
252+
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",
253+
month = aug,
254+
year = "2024",
255+
address = "Bangkok, Thailand",
256+
publisher = "Association for Computational Linguistics",
257+
url = "https://aclanthology.org/2024.acl-demos.29",
258+
pages = "305--314",
172259
}
173260
```
174261

docs/1_getting_started.md

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
* [What is the `PublisherCollection`](#what-is-the-publishercollection)
55
* [What is a `Crawler`](#what-is-a-crawler)
66
* [How to crawl articles](#how-to-crawl-articles)
7+
* [Saving crawled articles](#saving-crawled-articles)
78

89
# Basics
910

@@ -83,5 +84,19 @@ for article in crawler.crawl():
8384
print(article)
8485
````
8586

87+
Additionally, you can set a timeout for the crawler in seconds.
88+
If the crawler does not receive a new article within the specified timeout period, it will terminate automatically.
89+
```` python
90+
for article in crawler.crawl(timeout=10):
91+
print(article)
92+
````
93+
This is especially useful when working with date-related article filters.
94+
Refer to [this section](4_how_to_filter_articles.md) to learn more about how to filter articles.
95+
96+
# Saving crawled articles
97+
98+
To save all crawled articles to a file use the `save_to_file` parameter of the `crawl` method.
99+
When given a path, the crawled articles will be saved as a JSON list using the
100+
[default article serialization](3_the_article_class.md#saving-an-article) and `UTF-8` encoding.
86101

87102
In the [next](2_crawl_from_cc_news.md) section we will show you how to crawl articles from the CC-NEWS dataset.

0 commit comments

Comments
 (0)