-
Notifications
You must be signed in to change notification settings - Fork 107
Add VN publisher (VnExpress) #802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
addie9800
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, thank you so much for adding our first Vietnamese publisher! This looks quite good already. I only have a couple of remarks to simplify the code.
| sources=[ | ||
| RSSFeed("https://vnexpress.net/rss/tin-moi-nhat.rss"), | ||
| Sitemap("https://vnexpress.net/sitemap.xml"), | ||
| NewsMap("https://vnexpress.net/google-news-sitemap.xml"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like they tricked you with the sitemap links, They all redirect you to home
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you can add the other RSSFeeds from here: https://vnexpress.net/rss as sources
|
|
||
| @attribute | ||
| def title(self) -> Optional[str]: | ||
| title_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/headline") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use scalar=True here, and it will not return a List. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/headline") to be unreliable? Usually, relying on the JSON should be sufficient.
|
|
||
| @attribute | ||
| def authors(self) -> List[str]: | ||
| author_data_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/author") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can pass in whatever you get back directly into generic_author_parsing. It is designed to work with various inputs. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/author") to be unreliable? Usually, relying on the JSON should be sufficient.
|
|
||
| @attribute | ||
| def publishing_date(self) -> Optional[datetime]: | ||
| date_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/datePublished") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you can use scalar=True as well. And it should be sufficient to use the JSON value.
|
|
||
| @attribute | ||
| def topics(self) -> List[str]: | ||
| ld_topics = self._parse_ld_keywords() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this greatly by just using generic_topic_parsing(self.precomputed.meta.get("keywords"), which essentially does the same thing your custom helper methods do.
|
|
||
|
|
||
| class VnExpressIntlParser(ParserProxy): | ||
| class V1(BaseParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The images attribute seems to be missing in this parser.
| class VnExpressIntlParser(ParserProxy): | ||
| class V1(BaseParser): | ||
| _summary_selector = CSSSelector("p.description") | ||
| _paragraph_selector = CSSSelector("article.fck_detail > p") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this article, the author is also extracted from the bottom of the article.
| ld_topics = self._parse_ld_keywords() | ||
| if ld_topics: | ||
| return ld_topics | ||
| return self._parse_meta_topics() No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are also some bloat topic like Tin nóng (= hot news), which should be removed.
| class SupportsBool(Protocol): | ||
| def __bool__(self) -> bool: | ||
| ... | ||
| def __bool__(self) -> bool: ... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably have a different black version installed. This PR should normally not edit these files.
Hi, I’ve added a new VN publisher (VnExpress).
Please review when you have time. Thank you!