You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This printout tells you that you successfully crawled two articles!
87
87
88
88
For each article, the printout details:
89
+
- the number of images included in the article
89
90
- the "Title" of the article, i.e. its headline
90
91
- the "Text", i.e. the main article body text
91
92
- the "URL" from which it was crawled
@@ -94,7 +95,7 @@ For each article, the printout details:
94
95
95
96
## Example 2: Crawl a specific news source
96
97
97
-
Maybe you want to crawl a specific news source instead. Let's crawl news articles from Washington Times only:
98
+
Maybe you want to crawl a specific news source instead. Let's crawl news articles from The New Yorker only:
98
99
99
100
```python
100
101
from fundus import PublisherCollection, Crawler
@@ -107,21 +108,95 @@ for article in crawler.crawl(max_articles=2):
107
108
print(article)
108
109
```
109
110
110
-
## Example 3: Crawl articles from CC-NEWS
111
+
## Example 3: Crawl 1 Million articles
111
112
112
-
If you're not familiar with CC-NEWS, check out their [paper](https://paperswithcode.com/dataset/cc-news).
113
+
To crawl such a vast amount of data, Fundus relies on the `CommonCrawl` web archive, in particular the news crawl `CC-NEWS`.
114
+
If you're not familiar with [`CommonCrawl`](https://commoncrawl.org/) or [`CC-NEWS`](https://commoncrawl.org/blog/news-dataset-available) check out their websites.
115
+
Simply import our `CCNewsCrawler` and make sure to check out our [tutorial](docs/2_crawl_from_cc_news.md) beforehand.
113
116
114
117
````python
115
118
from fundus import PublisherCollection, CCNewsCrawler
116
119
117
-
# initialize the crawler for news publishers based in the US
118
-
crawler = CCNewsCrawler(*PublisherCollection.us)
120
+
# initialize the crawler using all publishers supported by fundus
121
+
crawler = CCNewsCrawler(*PublisherCollection)
119
122
120
-
# crawl 2 articles and print
121
-
for article in crawler.crawl(max_articles=2):
123
+
# crawl 1 million articles and print
124
+
for article in crawler.crawl(max_articles=1000000):
122
125
print(article)
123
126
````
124
127
128
+
**_Note_**: By default, the crawler utilizes all available CPU cores on your system.
129
+
For optimal performance, we recommend manually setting the number of processes using the `processes` parameter.
130
+
A good rule of thumb is to allocate `one process per 200 Mbps of bandwidth`.
131
+
This can vary depending on core speed.
132
+
133
+
**_Note_**: The crawl above took ~7 hours using the entire `PublisherCollection` on a machine with 1000 Mbps connection, Core i9-13905H, 64GB Ram, Windows 11 and without printing the articles.
134
+
The estimated time can vary substantially depending on the publisher used and the available bandwidth.
135
+
Additionally, not all publishers are included in the `CC-NEWS` crawl (especially US based publishers).
136
+
For large corpus creation, one can also use the regular crawler by utilizing only sitemaps, which requires significantly less bandwidth.
137
+
138
+
````python
139
+
from fundus import PublisherCollection, Crawler, Sitemap
140
+
141
+
# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
for article in crawler.crawl(max_articles=1000000):
146
+
print(article)
147
+
````
148
+
149
+
150
+
## Example 4: Crawl some images
151
+
152
+
By default, Fundus tries to parse the images included in every crawled article.
153
+
Let's crawl an article and print out the images for some more details.
154
+
155
+
```python
156
+
from fundus import PublisherCollection, Crawler
157
+
158
+
# initialize the crawler for The LA Times
159
+
crawler = Crawler(PublisherCollection.us.LATimes)
160
+
161
+
# crawl 1 article and print the images
162
+
for article in crawler.crawl(max_articles=1):
163
+
for image in article.images:
164
+
print(image)
165
+
```
166
+
167
+
For [this article](https://www.latimes.com/sports/lakers/story/2024-12-13/lakers-lebron-james-away-from-team-timberwolves) you will get the following output:
-Description: 'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'
173
+
-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'
-Description: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'
187
+
-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'
The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:
0 commit comments