-
Notifications
You must be signed in to change notification settings - Fork 107
Add some Independent Online newspapers #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| def load_html_test_file_mapping(publisher: Publisher) -> Dict[Type[BaseParser], HTMLTestFile]: | ||
| html_paths = (test_resource_path / Path(f"{publisher.__group__.__name__.lower()}")).glob( | ||
| f"{publisher.__name__}*.html.gz" | ||
| f"{publisher.__name__}_*.html.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is intentional, since we now have two newspapers, where the name of one is a substring of the name of the other IsolezweLesiXhosa and Isolezwe. When running pytest for Isolezwe you would also select the html test file for the other newspaper, causing the test to fail.
src/fundus/publishers/za/__init__.py
Outdated
| Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/iindaba/", languages={"xh"}), | ||
| Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezemidlalo/", languages={"xh"}), | ||
| Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/ezoyolo/", languages={"xh"}), | ||
| Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/izimvo/", languages={"xh"}), | ||
| Sitemap("https://isolezwelesixhosa.co.za/sitemap/isolezwe-lesixhosa/entsimini/", languages={"xh"}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we do the same trick here as with the publishers above?
Further i saw some english language sitemaps, maybe it would be good to add them? Could be beneficial for cross lingual corpora. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, now that I think about it yes. I tried to find one regex that selects all xhosa sitemaps, but not the english ones. I struggled to find that. But I realised, that I can use two separate ones and and them together. I'll update that.
Unfortunately the English Sitemaps are empty :/ But usually I would definitely agree.
No description provided.