Skip to content

Conversation

@bucheben
Copy link

WIP, because I require external input to complete this PR.

Ukrainska Pravda is a news site which publishes most articles in Ukrainian, English, and Russian.
The sitemap directly gives me the page of the NewsMap. In there, the same article in the different languages are grouped together and differentiated with the hreflang tag, does fundus support it? Should I instead simply select a single language for now?

The primary issue for me right now is how to extract the article body, as this is the part on which the tutorial focuses the least.

This English article for instance has the entire article body (only!) in the easily accessible precomputed.ld data.
Meanwhile, this Ukrainian article spreads out the article over many

tags.
This Ukrainian finance article suddenly places the author somewhere completely different.

Was I unfortunate with my choice of news site, or am I missing too much domain knowledge in DOM navigation?

@MaxDall
Copy link
Collaborator

MaxDall commented Oct 24, 2025

@bucheben Thanks so much for your work so far!

Was I just unlucky with my choice of news site, or am I missing some key knowledge about DOM navigation?

Regarding the sitemap: you definitely picked a challenging one 😅, but I’m glad you did! It actually seems like a great opportunity to extend Fundus with some useful new functionality, so thank you for choosing this outlet.

For now, I’d suggest continuing to use the existing Fundus implementation, as you’ve already done, and simply include the sitemap as well.

As for the article body: after looking through a few examples, it seems that despite the different languages, the articles all share the same layout. For example, in all three cases, the paragraphs can be selected using this CSS selector:

div.post_news_text > p

This Ukrainian finance article suddenly places the author somewhere completely different.

If possible, try to extract the author information from the article’s metadata — for example, from the ld or meta tags located in the precomputed section.

Let me know if that resolves your issue or if there’s anything else I can help with!

@bucheben
Copy link
Author

Thanks for the quick response!
Your CSSSelector works nicely, I'm not sure what I did when I determined that the article body is only in the ld section in that one article.

The retrieval of the author is odd, to say the least. For instance, the two most recent articles are the same one in Ukrainian and Russian.

Russian article `self.precomputed.ld.__dict__`
{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Египет стал главным покупателем украинского зерна', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https
://epravda.com.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/'}, 'headline': 'Египет стал главным покупателем украинского зерна', 'datePublished': '2025-10-24 15:10
:00', 'dateModified': '2025-10-24 15:10:00', 'image': {'@type': 'ImageObject', 'url': '?q=90&w=1920', 'height': 1200, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Экономичес
кая правда', 'alternateName': 'Экономическая правда'}, 'description': 'Египет стал основным импортером украинского зерна в октябре.', 'publisher': {'type': 'Organization', 'name': 'Экономич
еская правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}
}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Экономическая правда
'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Бизнес', 'name': 'https://epravda.com.ua/rus/biznes/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com
.ua/rus/biznes/egipet-stal-glavnym-pokupatelem-ukrainskogo-zerna-813311/', 'name': 'Египет стал главным покупателем украинского зерна'}}]}, 'ProfilePage': {'@context': 'https://schema.org',
 '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 5383, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/f/0/46204/f0552447071e174c53fb0ccc8e3e9693.
jpeg', 'description': 'редактор стрічки новин', 'name': 'Андрій Муравський'}}, '_LinkedDataMapping__xml': None}
Ukrainian article `self.precomputed.ld.__dict__`
{'NewsArticle': {'@context': 'http://schema.org', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', '@type': 'NewsArticle', 'mainEntityOfPage': {'@type': 'WebPage', '@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/'}, 'headline': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року', 'datePublished': '2025-10-24 14:50:00', 'dateModified': '2025-10-24 14:50:00', 'image': {'@type': 'ImageObject', 'url': 'https://img.epravda.com.ua/epravda/images/doc/2/f/55588/2fb1bb061339220fc50832e157b70fb2.jpeg?q=90&w=1920', 'height': 672.1649484536083, 'width': 1200}, 'author': {'@type': 'Organization', 'name': 'Економічна правда', 'alternateName': 'Економічна правда'}, 'description': 'До кінця 2025 року державні компанії планують встановити ще 400 МВт розподіленої газової генерації.', 'publisher': {'type': 'Organization', 'name': 'Економічна правда', 'logo': {'@type': 'ImageObject', 'url': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'image': 'https://epravda.com.ua/epravda/i/ep_logo.svg', 'width': 100, 'height': 100}}}, 'BreadcrumbList': {'@context': 'http://schema.org', '@type': 'BreadcrumbList', 'itemListElement': [{'@type': 'ListItem', 'position': 1, 'item': {'@id': '/', 'name': 'Економічна правда'}}, {'@type': 'ListItem', 'position': 2, 'item': {'@id': 'Енергетика', 'name': 'https://epravda.com.ua/energetika/'}}, {'@type': 'ListItem', 'position': 3, 'item': {'@id': 'https://epravda.com.ua/energetika/skilki-rozpodilenoji-generaciji-derzhkompaniji-zapustyat-do-kincya-2025-roku-813310/', 'name': 'Гринчук запевняє, що держкомпанії запустять  400 МВт розподіленої генерації до кінця року'}}]}, 'ProfilePage': {'@context': 'https://schema.org', '@type': 'ProfilePage', 'mainEntity': {'@type': 'Person', 'identifier': 2172, 'image': 'https://img.epravda.com.ua/epravda/journalist/images/doc/0/4/2250/041f5fe-victor-volokhita-160.jpg', 'description': 'Редактор новин "Економічної правди"\r\n<br>\r\n<br>В ЕП з травня 2024 року. До цього останні 10 років працював у виданні "Наші гроші".', 'name': 'Віктор Волокіта'}}, '_LinkedDataMapping__xml': None}

Both articles appear to list Андрій Муравський as the article author on the website. In the Russian version, I can simply retrieve the author with self.precomputed.ld.xpath_search('ProfilePage/mainEntity/name'). On the Ukrainian, version, the name doesn't even appear in the ld and instead it lists Віктор Волокіта in the same place. After writing this whole thing, this might just be an error of the news site(?) :/

However, this seems to be mostly-ish working for now.
More testing pointed me to news articles which break the parser entirely. E.g., https://www.pravda.com.ua/news/2025/10/24/8004300/. This link is listed in the newsmap and the links appears very normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their urls appear normal in the newsmap?

On another note, none of the news articles I took a look at had clear subheadings. Some do have <bold> lines. However, like in this article, there are also sometimes bold not-really-subheadings.
So as of now I've not set a subheadings selector. The articles do have a summary, but I'm stumped on how to extract them, as they are not the content of a text, but in an attribute.

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

ERROR tests/test_filter.py - DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Con...

This prevents the running of any tests, as these count as errors during the collection of the tests.

mypy src also runs into errors in unrelated files:

src/fundus/parser/utility.py:505: error: List item 9 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]
src/fundus/parser/utility.py:507: error: List item 11 has incompatible type "tuple[str, str, str, str]"; expected "tuple[str, str] | tuple[str, str, str]"  [list-item]

(I haven't added the sitemap yet, will do)

@MaxDall
Copy link
Collaborator

MaxDall commented Oct 29, 2025

For instance, the two most recent articles are the same one in Ukrainian and Russian.

I’d rely on the most trustworthy source, in this case, the author listed directly on the page. You can extract it using a CSS selector such as span.post_news_author.

Additionally, I’ve noticed that many problematic articles come from a different (though similar) domain: https://epravda.com.ua/ instead of the original https://www.pravda.com.ua/. This likely stems from the alternative articles. If that’s the case, focus only on those from https://www.pravda.com.ua/.

This link is listed in the newsmap and looks normal, but it actually redirects to another site, eurointegration.com.ua, which has a different layout. Can I filter these somehow, even though their URLs appear normal in the newsmap?

Yes, you can use the url_filter parameter in Publisher, for example:

DieWelt = Publisher(
    name="Die Welt",
    ...
    url_filter=regex_filter("/Anlegertipps-|/videos?[0-9]{2}|/mediathek/"),
)

This filters out URLs containing specific substrings (e.g., Anlegertipps).
Since Fundus filters work inversely to Python’s built-in filtering logic, you can use the Fundus inverse function to allow URLs based on a substring rather than exclude them:

from fundus.scraping.filter import inverse

When I run pytest, it throws a bunch of errors. Perhaps because I'm on a newer Python version (3.13) than the project.

Thanks for pointing that out! It seems Fundus currently has compatibility issues with Python versions above 3.12. I’ll open an issue to investigate further, but for now, I recommend using an older Python version.

mypy src also runs into errors in unrelated files:

That was a known issue on our end (#806) and should be resolved once you merge the latest master branch into your branch.

@bucheben
Copy link
Author

With these changes I'm now happy with the state of the publisher

@bucheben
Copy link
Author

Hm I messed up the history somehow

@bucheben
Copy link
Author

fixed it :)

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bucheben Thanks for addressing all my suggested changes. We're almost there. Besides the two comments I left,, would you mind adding image_extraction as well?

Comment on lines +2 to +4
"Pravda_2025_10_24.html.gz": {
"url": "https://epravda.com.ua/tehnologiji/yak-v-ukrajini-diyatimut-limiti-dlya-gravciv-azartnih-igor-813313",
"crawl_date": "2025-10-24 15:15:05.296765"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test article seems to be from a different publisher. Could you add a more suitable test case.


class PravdaParser(ParserProxy):
class V1(BaseParser):
_paragraph_selector = CSSSelector("div.post_news_text > p")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This publisher also includes sub-headlines as seen in this article. Could you add a selector for that as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants