Add VN publisher (VnExpress) #802

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

bachthyaglx wants to merge 4 commits into flairNLP:master from bachthyaglx:add-vn-publisher

bachthyaglx commented Oct 22, 2025 •

edited

Loading

Hi, I’ve added a new VN publisher (VnExpress).
Please review when you have time. Thank you!

bachthyaglx and others added 4 commits

October 22, 2025 15:17


          Add VN publisher

fd72dee


          Delete test.py

4f578eb


          Edit __init__.py for vn

fe13ca3


          Merge branch 'add-vn-publisher' of https://github.com/bachthyaglx/fundus

7fc3042

 into add-vn-publisher

addie9800 requested changes

View reviewed changes

Collaborator

addie9800 left a comment

Hey, thank you so much for adding our first Vietnamese publisher! This looks quite good already. I only have a couple of remarks to simplify the code.

src/fundus/publishers/vn/__init__.py

    
                  sources=[

                    RSSFeed("https://vnexpress.net/rss/tin-moi-nhat.rss"),

                    Sitemap("https://vnexpress.net/sitemap.xml"),

                    NewsMap("https://vnexpress.net/google-news-sitemap.xml"),

Collaborator

addie9800 Oct 23, 2025

Seems like they tricked you with the sitemap links, They all redirect you to home

Collaborator

addie9800 Oct 23, 2025

Instead you can add the other RSSFeeds from here: https://vnexpress.net/rss as sources

src/fundus/publishers/vn/vnexpress.py

    
                  @attribute

                  def title(self) -> Optional[str]:

                    title_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/headline")

Collaborator

addie9800 Oct 23, 2025

You can use scalar=True here, and it will not return a List. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/headline") to be unreliable? Usually, relying on the JSON should be sufficient.

src/fundus/publishers/vn/vnexpress.py

    
                  @attribute

                  def authors(self) -> List[str]:

                    author_data_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/author")

Collaborator

addie9800 Oct 23, 2025

You can pass in whatever you get back directly into generic_author_parsing. It is designed to work with various inputs. Have you observed self.precomputed.ld.xpath_search("//NewsArticle/author") to be unreliable? Usually, relying on the JSON should be sufficient.

src/fundus/publishers/vn/vnexpress.py

    
                  @attribute

                  def publishing_date(self) -> Optional[datetime]:

                    date_list: List[Any] = self.precomputed.ld.xpath_search("//NewsArticle/datePublished")

Collaborator

addie9800 Oct 23, 2025

Here, you can use scalar=True as well. And it should be sufficient to use the JSON value.

src/fundus/publishers/vn/vnexpress.py

    
                  @attribute

                  def topics(self) -> List[str]:

                    ld_topics = self._parse_ld_keywords()

Collaborator

addie9800 Oct 23, 2025

You can simplify this greatly by just using generic_topic_parsing(self.precomputed.meta.get("keywords"), which essentially does the same thing your custom helper methods do.

src/fundus/publishers/vn/vnexpress.py

    
              class VnExpressIntlParser(ParserProxy):

                class V1(BaseParser):

Collaborator

addie9800 Oct 23, 2025

The images attribute seems to be missing in this parser.

src/fundus/publishers/vn/vnexpress.py

    
              class VnExpressIntlParser(ParserProxy):

                class V1(BaseParser):

                  _summary_selector = CSSSelector("p.description")

                  _paragraph_selector = CSSSelector("article.fck_detail > p")

Collaborator

addie9800 Oct 23, 2025

In this article, the author is also extracted from the bottom of the article.

src/fundus/publishers/vn/vnexpress.py

    
                    ld_topics = self._parse_ld_keywords()

                    if ld_topics:

                      return ld_topics

                    return self._parse_meta_topics()

  No newline at end of file

Collaborator

addie9800 Oct 23, 2025

There are also some bloat topic like Tin nóng (= hot news), which should be removed.

src/fundus/scraping/filter.py

    
              class SupportsBool(Protocol):

                  def __bool__(self) -> bool:

                      ...

                  def __bool__(self) -> bool: ...

Collaborator

addie9800 Oct 23, 2025

You probably have a different black version installed. This PR should normally not edit these files.

addie9800 self-assigned this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet