-
Notifications
You must be signed in to change notification settings - Fork 107
[WIP] Add publisher pravda (Ukrainska Pravda) #807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@bucheben Thanks so much for your work so far!
Regarding the sitemap: you definitely picked a challenging one 😅, but I’m glad you did! It actually seems like a great opportunity to extend Fundus with some useful new functionality, so thank you for choosing this outlet. For now, I’d suggest continuing to use the existing Fundus implementation, as you’ve already done, and simply include the sitemap as well. As for the article body: after looking through a few examples, it seems that despite the different languages, the articles all share the same layout. For example, in all three cases, the paragraphs can be selected using this CSS selector: div.post_news_text > p
If possible, try to extract the author information from the article’s metadata — for example, from the Let me know if that resolves your issue or if there’s anything else I can help with! |
|
Thanks for the quick response! The retrieval of the author is odd, to say the least. For instance, the two most recent articles are the same one in Ukrainian and Russian. Russian article `self.precomputed.ld.__dict__`Ukrainian article `self.precomputed.ld.__dict__`Both articles appear to list However, this seems to be mostly-ish working for now. On another note, none of the news articles I took a look at had clear subheadings. Some do have When I run This prevents the running of any tests, as these count as errors during the collection of the tests.
(I haven't added the sitemap yet, will do) |
I’d rely on the most trustworthy source, in this case, the author listed directly on the page. You can extract it using a CSS selector such as Additionally, I’ve noticed that many problematic articles come from a different (though similar) domain:
Yes, you can use the DieWelt = Publisher(
name="Die Welt",
...
url_filter=regex_filter("/Anlegertipps-|/videos?[0-9]{2}|/mediathek/"),
)This filters out URLs containing specific substrings (e.g., from fundus.scraping.filter import inverse
Thanks for pointing that out! It seems Fundus currently has compatibility issues with Python versions above 3.12. I’ll open an issue to investigate further, but for now, I recommend using an older Python version.
That was a known issue on our end (#806) and should be resolved once you merge the latest master branch into your branch. |
|
With these changes I'm now happy with the state of the publisher |
|
Hm I messed up the history somehow |
|
fixed it :) |
MaxDall
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bucheben Thanks for addressing all my suggested changes. We're almost there. Besides the two comments I left,, would you mind adding image_extraction as well?
| "Pravda_2025_10_24.html.gz": { | ||
| "url": "https://epravda.com.ua/tehnologiji/yak-v-ukrajini-diyatimut-limiti-dlya-gravciv-azartnih-igor-813313", | ||
| "crawl_date": "2025-10-24 15:15:05.296765" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test article seems to be from a different publisher. Could you add a more suitable test case.
|
|
||
| class PravdaParser(ParserProxy): | ||
| class V1(BaseParser): | ||
| _paragraph_selector = CSSSelector("div.post_news_text > p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This publisher also includes sub-headlines as seen in this article. Could you add a selector for that as well?
WIP, because I require external input to complete this PR.
Ukrainska Pravda is a news site which publishes most articles in Ukrainian, English, and Russian.
The sitemap directly gives me the page of the NewsMap. In there, the same article in the different languages are grouped together and differentiated with the
hreflangtag, does fundus support it? Should I instead simply select a single language for now?The primary issue for me right now is how to extract the article body, as this is the part on which the tutorial focuses the least.
This English article for instance has the entire article body (only!) in the easily accessible
precomputed.lddata.Meanwhile, this Ukrainian article spreads out the article over many
tags.
This Ukrainian finance article suddenly places the author somewhere completely different.
Was I unfortunate with my choice of news site, or am I missing too much domain knowledge in DOM navigation?