Skip to content

Conversation

@addie9800
Copy link
Collaborator

No description provided.

def load_html_test_file_mapping(publisher: Publisher) -> Dict[Type[BaseParser], HTMLTestFile]:
html_paths = (test_resource_path / Path(f"{publisher.__group__.__name__.lower()}")).glob(
f"{publisher.__name__}*.html.gz"
f"{publisher.__name__}_*.html.gz"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this intentional?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is intentional, since we now have two newspapers, where the name of one is a substring of the name of the other IsolezweLesiXhosa and Isolezwe. When running pytest for Isolezwe you would also select the html test file for the other newspaper, causing the test to fail.

Comment on lines 65 to 69
Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/iindaba/", languages={"xh"}),
Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezemidlalo/", languages={"xh"}),
Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezoyolo/", languages={"xh"}),
Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/izimvo/", languages={"xh"}),
Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/entsimini/", languages={"xh"}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we do the same trick here as with the publishers above?

Further i saw some english language sitemaps, maybe it would be good to add them? Could be beneficial for cross lingual corpora. What do you think?

Copy link
Collaborator Author

@addie9800 addie9800 Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, now that I think about it yes. I tried to find one regex that selects all xhosa sitemaps, but not the english ones. I struggled to find that. But I realised, that I can use two separate ones and and them together. I'll update that.
Unfortunately the English Sitemaps are empty :/ But usually I would definitely agree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants