Skip to content

Conversation

@baurlaur
Copy link

No description provided.

@baurlaur
Copy link
Author

Added publisher Klasse gegen Klasse(DE)

  • Implemented parser (author, publishing_date, topics, body
  • Added test fixtures + JSON
  • All tests passed locally

Copy link
Collaborator

@MaxDall MaxDall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@baurlaur Thanks a lot for the massive amount of work you put into this 🚀 Some minor things: Would you mind adding title and image_extraction as well?

Comment on lines +606 to +608
Sitemap(
"https://www.klassegegenklasse.org/wp-sitemap.xml",
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the sitemap includes a lot of unnecessary entries. You can use the sitemap_filter parameter of Sitemap to exclude unwanted sitemap URLs. For example, sitemap_filter=inverse(regex_filter("wp-sitemap-posts-post")) will exclude any sitemap that does not contain the substring wp-sitemap-posts-post.

Comment on lines +43 to +50
nodes = self.precomputed.doc.xpath(
"//a[@rel='author']/text()"
" | //span[contains(@class,'author')]//a/text()"
" | //div[contains(@class,'author')]//a/text()"
" | //div[contains(@class,'byline')]//a/text()"
" | //span[contains(@class,'byline')]//a/text()"
" | //a[contains(@href,'/autor/') or contains(@href,'/author/')]/text()"
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clicking through some of the articles it seems to me that the author can usually be found at the same place. Could you elaborate on your reasoning for adding all the extra selectors?

Comment on lines +59 to +60
@attribute
def publishing_date(self) -> Optional[datetime]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as with authors. It seems that the publishing date of an article is displayed at the same location for the articles i encountered so far. Could you elaborate why you added the extra fall-backs?

@MaxDall MaxDall self-assigned this Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants