Skip to content

Conversation

@bachthyaglx
Copy link

@bachthyaglx bachthyaglx commented Oct 22, 2025

Hi, I’ve added a new VN publisher (VnExpress).
Please review when you have time. Thank you!

Copy link
Collaborator

@addie9800 addie9800 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you so much for adding our first Vietnamese publisher! This looks quite good already. I only have a couple of remarks to simplify the code.

sources=[
RSSFeed("https://vnexpress.net/rss/tin-moi-nhat.rss"),
Sitemap("https://vnexpress.net/sitemap.xml"),
NewsMap("https://vnexpress.net/google-news-sitemap.xml"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like they tricked you with the sitemap links, They all redirect you to home

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead you can add the other RSSFeeds from here: https://vnexpress.net/rss as sources


@attribute
def title(self) -> Optional[str]:
title_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/headline")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use scalar=True here, and it will not return a List. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/headline") to be unreliable? Usually, relying on the JSON should be sufficient.


@attribute
def authors(self) -> List[str]:
author_data_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/author")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can pass in whatever you get back directly into generic_author_parsing. It is designed to work with various inputs. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/author") to be unreliable? Usually, relying on the JSON should be sufficient.


@attribute
def publishing_date(self) -> Optional[datetime]:
date_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/datePublished")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you can use scalar=True as well. And it should be sufficient to use the JSON value.


@attribute
def topics(self) -> List[str]:
ld_topics = self._parse_ld_keywords()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simplify this greatly by just using generic_topic_parsing(self.precomputed.meta.get("keywords"), which essentially does the same thing your custom helper methods do.



class VnExpressIntlParser(ParserProxy):
class V1(BaseParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The images attribute seems to be missing in this parser.

class VnExpressIntlParser(ParserProxy):
class V1(BaseParser):
_summary_selector = CSSSelector("p.description")
_paragraph_selector = CSSSelector("article.fck_detail > p")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this article, the author is also extracted from the bottom of the article.

ld_topics = self._parse_ld_keywords()
if ld_topics:
return ld_topics
return self._parse_meta_topics() No newline at end of file
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some bloat topic like Tin nóng (= hot news), which should be removed.

class SupportsBool(Protocol):
def __bool__(self) -> bool:
...
def __bool__(self) -> bool: ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably have a different black version installed. This PR should normally not edit these files.

@addie9800 addie9800 self-assigned this Oct 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants