[WIP] Add publisher pravda (Ukrainska Pravda) #807

bucheben · 2025-10-23T12:13:27Z

WIP, because I require external input to complete this PR.

Ukrainska Pravda is a news site which publishes most articles in Ukrainian, English, and Russian.
The sitemap directly gives me the page of the NewsMap. In there, the same article in the different languages are grouped together and differentiated with the hreflang tag, does fundus support it? Should I instead simply select a single language for now?

The primary issue for me right now is how to extract the article body, as this is the part on which the tutorial focuses the least.

This English article for instance has the entire article body (only!) in the easily accessible precomputed.ld data.
Meanwhile, this Ukrainian article spreads out the article over many

tags.
This Ukrainian finance article suddenly places the author somewhere completely different.

Was I unfortunate with my choice of news site, or am I missing too much domain knowledge in DOM navigation?

MaxDall · 2025-10-24T09:26:01Z

@bucheben Thanks so much for your work so far!

Was I just unlucky with my choice of news site, or am I missing some key knowledge about DOM navigation?

Regarding the sitemap: you definitely picked a challenging one 😅, but I’m glad you did! It actually seems like a great opportunity to extend Fundus with some useful new functionality, so thank you for choosing this outlet.

For now, I’d suggest continuing to use the existing Fundus implementation, as you’ve already done, and simply include the sitemap as well.

As for the article body: after looking through a few examples, it seems that despite the different languages, the articles all share the same layout. For example, in all three cases, the paragraphs can be selected using this CSS selector:

div.post_news_text > p

This Ukrainian finance article suddenly places the author somewhere completely different.

If possible, try to extract the author information from the article’s metadata — for example, from the ld or meta tags located in the precomputed section.

Let me know if that resolves your issue or if there’s anything else I can help with!

bucheben · 2025-10-24T13:28:42Z

Thanks for the quick response!
Your CSSSelector works nicely, I'm not sure what I did when I determined that the article body is only in the ld section in that one article.

The retrieval of the author is odd, to say the least. For instance, the two most recent articles are the same one in Ukrainian and Russian.

Russian article `self.precomputed.ld.__dict__`

{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Египет стал главным покупателем украинского зерна', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https
://epravda.com.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/'}, 'headline': 'Египет стал главным покупателем украинского зерна', 'datePublished': '2025-10-24 15:10
:00', 'dateModified': '2025-10-24 15:10:00', 'image': {'@type': 'ImageObject', 'url': '?q=90&w=1920', 'height': 1200, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Экономичес
кая правда', 'alternateName': 'Экономическая правда'}, 'description': 'Египет стал основным импортером украинского зерна в октябре.', 'publisher': {'type': 'Organization', 'name': 'Экономич
еская правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}
}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Экономическая правда
'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Бизнес', 'name': 'https://epravda.com.ua/rus/biznes/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com
.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/', 'name': 'Египет стал главным покупателем украинского зерна'}}]}, 'ProfilePage': {'@context': 'https://schema.org',
 '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 5383, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/f/0/46204/f0552447071e174c53fb0ccc8e3e9693.
jpeg', 'description': 'редактор стрічки новин', 'name': 'Андрій Муравський'}}, '_LinkedDataMapping__xml': None}

Ukrainian article `self.precomputed.ld.__dict__`

{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/'}, 'headline': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', 'datePublished': '2025-10-24 14:50:00', 'dateModified': '2025-10-24 14:50:00', 'image': {'@type': 'ImageObject', 'url': 'https://img.epravda.com.ua/epravda/images/doc/2/f/55588/2fb1bb061339220fc50832e157b70fb2.jpeg?q=90&w=1920', 'height': 672.1649484536083, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Економічна правда', 'alternateName': 'Економічна правда'}, 'description': 'До кінця 2025 року державні компанії планують встановити ще 400 МВт розподіленої газової генерації.', 'publisher': {'type': 'Organization', 'name': 'Економічна правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Економічна правда'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Енергетика', 'name': 'https://epravda.com.ua/energetika/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року'}}]}, 'ProfilePage': {'@context': 'https://schema.org', '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 2172, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/0/4/2250/041f5fe-victor-volokhita-160.jpg', 'description': 'Редактор новин "Економічної правди"\r\n<br>\r\n<br>В ЕП з травня 2024 року. До цього останні 10 років працював у виданні "Наші гроші".', 'name': 'Віктор Волокіта'}}, '_LinkedDataMapping__xml': None}

Both articles appear to list Андрій Муравський as the article author on the website. In the Russian version, I can simply retrieve the author with self.precomputed.ld.xpath_search('ProfilePage/mainEntity/name'). On the Ukrainian, version, the name doesn't even appear in the ld and instead it lists Віктор Волокіта in the same place. After writing this whole thing, this might just be an error of the news site(?) :/

However, this seems to be mostly-ish working for now.
More testing pointed me to news articles which break the parser entirely. E.g., https://www.pravda.com.ua/news/2025/10/24/8004300/. This link is listed in the newsmap and the links appears very normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their urls appear normal in the newsmap?

On another note, none of the news articles I took a look at had clear subheadings. Some do have <bold> lines. However, like in this article, there are also sometimes bold not-really-subheadings.
So as of now I've not set a subheadings selector. The articles do have a summary, but I'm stumped on how to extract them, as they are not the content of a text, but in an attribute.

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

ERROR tests/test_filter.py - DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Con...

This prevents the running of any tests, as these count as errors during the collection of the tests.

mypy src also runs into errors in unrelated files:

src/fundus/parser/utility.py:505: error: List item 9 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]
src/fundus/parser/utility.py:507: error: List item 11 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]

(I haven't added the sitemap yet, will do)

MaxDall · 2025-10-29T13:19:24Z

For instance, the two most recent articles are the same one in Ukrainian and Russian.

I’d rely on the most trustworthy source, in this case, the author listed directly on the page. You can extract it using a CSS selector such as span.post_news_author.

Additionally, I’ve noticed that many problematic articles come from a different (though similar) domain: https://epravda.com.ua/ instead of the original https://www.pravda.com.ua/. This likely stems from the alternative articles. If that’s the case, focus only on those from https://www.pravda.com.ua/.

This link is listed in the newsmap and looks normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their URLs appear normal in the newsmap?

Yes, you can use the url_filter parameter in Publisher, for example:

DieWelt = Publisher(
    name="Die Welt",
    ...
    url_filter=regex_filter("/Anlegertipps-|/videos?[0-9]{2}|/mediathek/"),
)

This filters out URLs containing specific substrings (e.g., Anlegertipps).
Since Fundus filters work inversely to Python’s built-in filtering logic, you can use the Fundus inverse function to allow URLs based on a substring rather than exclude them:

from fundus.scraping.filter import inverse

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

Thanks for pointing that out! It seems Fundus currently has compatibility issues with Python versions above 3.12. I’ll open an issue to investigate further, but for now, I recommend using an older Python version.

mypy src also runs into errors in unrelated files:

That was a known issue on our end (#806) and should be resolved once you merge the latest master branch into your branch.

bucheben · 2025-10-29T16:27:40Z

With these changes I'm now happy with the state of the publisher

bucheben · 2025-10-30T10:47:07Z

Hm I messed up the history somehow

bucheben · 2025-10-30T10:52:14Z

fixed it :)

MaxDall

@bucheben Thanks for addressing all my suggested changes. We're almost there. Besides the two comments I left,, would you mind adding image_extraction as well?

MaxDall · 2025-11-04T15:06:18Z

tests/resources/parser/test_data/ua/meta.info

+  "Pravda_2025_10_24.html.gz": {
+    "url": "https://epravda.com.ua/tehnologiji/yak-v-ukrajini-diyatimut-limiti-dlya-gravciv-azartnih-igor-813313",
+    "crawl_date": "2025-10-24 15:15:05.296765"


The test article seems to be from a different publisher. Could you add a more suitable test case.

MaxDall · 2025-11-04T15:08:20Z

src/fundus/publishers/ua/pravda.py

+
+class PravdaParser(ParserProxy):
+    class V1(BaseParser):
+        _paragraph_selector = CSSSelector("div.post_news_text > p")


This publisher also includes sub-headlines as seen in this article. Could you add a selector for that as well?

MaxDall self-assigned this Oct 24, 2025

bucheben force-pushed the pr-pravda branch from d702ef2 to 55797ab Compare October 24, 2025 13:27

bucheben force-pushed the pr-pravda branch from c1ddb18 to 40203d7 Compare October 29, 2025 16:25

bucheben force-pushed the pr-pravda branch from 40203d7 to 7b6748e Compare October 30, 2025 10:44

add publisher Pravda

c5057b8

bucheben force-pushed the pr-pravda branch from 7b6748e to c5057b8 Compare October 30, 2025 10:52

MaxDall requested changes Nov 4, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add publisher pravda (Ukrainska Pravda) #807

[WIP] Add publisher pravda (Ukrainska Pravda) #807

Uh oh!

bucheben commented Oct 23, 2025

Uh oh!

MaxDall commented Oct 24, 2025

Uh oh!

bucheben commented Oct 24, 2025

Uh oh!

MaxDall commented Oct 29, 2025

Uh oh!

bucheben commented Oct 29, 2025

Uh oh!

bucheben commented Oct 30, 2025

Uh oh!

bucheben commented Oct 30, 2025

Uh oh!

MaxDall left a comment

Uh oh!

MaxDall Nov 4, 2025

Uh oh!

MaxDall Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[WIP] Add publisher pravda (Ukrainska Pravda) #807

Are you sure you want to change the base?

[WIP] Add publisher pravda (Ukrainska Pravda) #807

Uh oh!

Conversation

bucheben commented Oct 23, 2025

Uh oh!

MaxDall commented Oct 24, 2025

Uh oh!

bucheben commented Oct 24, 2025

Uh oh!

MaxDall commented Oct 29, 2025

Uh oh!

bucheben commented Oct 29, 2025

Uh oh!

bucheben commented Oct 30, 2025

Uh oh!

bucheben commented Oct 30, 2025

Uh oh!

MaxDall left a comment

Choose a reason for hiding this comment

Uh oh!

MaxDall Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

MaxDall Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants