Add some Independent Online newspapers #827

addie9800 · 2025-11-09T00:09:15Z

No description provided.

MaxDall · 2025-11-10T23:17:10Z

tests/utility.py

 def load_html_test_file_mapping(publisher: Publisher) -> Dict[Type[BaseParser], HTMLTestFile]:
    html_paths = (test_resource_path / Path(f"{publisher.__group__.__name__.lower()}")).glob(
-        f"{publisher.__name__}*.html.gz"
+        f"{publisher.__name__}_*.html.gz"


Is this intentional?

Yes, this is intentional, since we now have two newspapers, where the name of one is a substring of the name of the other IsolezweLesiXhosa and Isolezwe. When running pytest for Isolezwe you would also select the html test file for the other newspaper, causing the test to fail.

MaxDall · 2025-11-10T23:24:45Z

src/fundus/publishers/za/__init__.py

+            Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/iindaba/", languages={"xh"}),
+            Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezemidlalo/", languages={"xh"}),
+            Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezoyolo/", languages={"xh"}),
+            Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/izimvo/", languages={"xh"}),
+            Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/entsimini/", languages={"xh"}),


Can't we do the same trick here as with the publishers above?

Further i saw some english language sitemaps, maybe it would be good to add them? Could be beneficial for cross lingual corpora. What do you think?

Actually, now that I think about it yes. I tried to find one regex that selects all xhosa sitemaps, but not the english ones. I struggled to find that. But I realised, that I can use two separate ones and and them together. I'll update that.
Unfortunately the English Sitemaps are empty :/ But usually I would definitely agree.

addie9800 added 3 commits November 9, 2025 01:09

Add some Independent Online newspapers

f8111eb

formatting

d00e31a

fix tests

772b0f4

MaxDall requested changes Nov 10, 2025

View reviewed changes

add sitemap_filter

c892817

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add some Independent Online newspapers #827

Add some Independent Online newspapers #827

Uh oh!

addie9800 commented Nov 9, 2025

Uh oh!

MaxDall Nov 10, 2025

Uh oh!

addie9800 Nov 13, 2025

Uh oh!

MaxDall Nov 10, 2025

Uh oh!

addie9800 Nov 13, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add some Independent Online newspapers #827

Are you sure you want to change the base?

Add some Independent Online newspapers #827

Uh oh!

Conversation

addie9800 commented Nov 9, 2025

Uh oh!

MaxDall Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

addie9800 Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

MaxDall Nov 10, 2025

Choose a reason for hiding this comment

Uh oh!

addie9800 Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

addie9800 Nov 13, 2025 •

edited

Loading