- Scrape base_Sitemaps
-
Read xml (using sitemap_versions)
-
updated_sitemaps <- filter for Sitemaps not in archive (lastmod > lastscrape)
-
amend archive base_sitemaps with updated_sitemaps
-
- Scrape updated_sitemaps 1. Read xml (using updated_sitemaps$loc) 2. new_articles <- filter for urls not in archive (lastmod > lastscrape)
-
compare to archived Sitemaps or archived articles$link?
-
redirects?
-
errors?
- amend archive with new_articles
-
-
- Scrape new_articles
-
select version-specific scraping function
-
DE = ESP = RS
- RS: twice for both versions?
-
EN ~ FR ~ RUS
-
Arabic: more difficult (DDos, specific Format)
-
-
Amend meta_archive (incl. Text) with every article scrape
::: {style="color: red"}
-
add hash from link linking meta_archive to media_archive
-
Either:
-
split up media_archive in multiple folders ordered by sub-hash
-
(split up by (version-year pair (cf. sitemap) or) version-month pair)
-
-
name media files {hash_no} :::
-
-
-
Either:
-
(amend linklist etc. with id/hash and (horizontal) vector of links) -
amend yt_list, twitter_list, linklist, ... with article_id - link pair
-
-
include reference_count for each link in yt/twitter/link_list
-
if internal link: include reference_count into meta_archive as internally_referenced
-
If not in meta_archive: amend meta_archive with note
-
try scrape in next run
-
if not exist: lookup on archive.org
-
-
-
else (if external link):
-
if available: save full page, including media (using archive.org servers?)
-
else: check if on archive.org (earliest version)
-
save link to archive.org
-
-
-
- check that all new_articles are in scrape & check for DDos-guard pages (unless previous error note)
-
log with every loop
-
run again for missing ones
-
if not exist: lookup at archive.org
-
add note
-
-
log again
-
-
if unusually high number of missings (or missings in specific fields): add note
- potentially rescrape those
-
- create db query method/API
- if including article list: split into multiple variables
- (if including media: create zipped folder)