Skip to content

Commit aa98e47

Browse files
authored
RSS Feed Document Loader Publish Date Fallback (#219)
As mentioned on issue langchain-ai/langchain#12202 currently the publish_date on RSS document loader is unstable, depending on newspaper3k's parsing of that date on the article URL. This PR creates a fallback on the loader to use the feed entry's publish date if newspaper3k doesn't return that in the article object.
1 parent 36b1f84 commit aa98e47

File tree

1 file changed

+5
-0
lines changed
  • libs/community/langchain_community/document_loaders

1 file changed

+5
-0
lines changed

libs/community/langchain_community/document_loaders/rss.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,11 @@ def lazy_load(self) -> Iterator[Document]:
124124
)
125125
article = loader.load()[0]
126126
article.metadata["feed"] = url
127+
# If the publish date is not set by newspaper, try to extract it from the feed entry
128+
if article.metadata.get("publish_date") is None:
129+
from datetime import datetime
130+
publish_date = entry.get("published_parsed", None)
131+
article.metadata["publish_date"] = datetime(*publish_date[:6]) if publish_date else None
127132
yield article
128133
except Exception as e:
129134
if self.continue_on_failure:

0 commit comments

Comments
 (0)