Skip to content
Open
57 changes: 57 additions & 0 deletions web_programming/crawl_hindustan_times_and_get_top_news.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
"""
Fetch all the top headlines from Hindustan Times News website with title, link to the news article

Check failure on line 2 in web_programming/crawl_hindustan_times_and_get_top_news.py

View workflow job for this annotation

GitHub Actions / ruff

Ruff (E501)

web_programming/crawl_hindustan_times_and_get_top_news.py:2:89: E501 Line too long (99 > 88)

Check failure on line 2 in web_programming/crawl_hindustan_times_and_get_top_news.py

View workflow job for this annotation

GitHub Actions / ruff

Ruff (W291)

web_programming/crawl_hindustan_times_and_get_top_news.py:2:99: W291 Trailing whitespace
and cover image link.

The following format is used while displaying the data

news = {
0: {
"title": <title-of-the-article>,
"link": <link-to-the-news-article>,
"img": <link-to-the-cover-image>
}
}
"""

import requests
from bs4 import BeautifulSoup


def fetch_ht_news():

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please provide return type hint for the function: fetch_ht_news. If the function does not return a value, please provide the type hint as: def function() -> None:


header = {
"Accept": "image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8",
"Sec-GPC": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/127.0.0.0 Safari/537.36",

Check failure on line 25 in web_programming/crawl_hindustan_times_and_get_top_news.py

View workflow job for this annotation

GitHub Actions / ruff

Ruff (E501)

web_programming/crawl_hindustan_times_and_get_top_news.py:25:89: E501 Line too long (136 > 88)
"sec-ch-ua": '"Not)A;Brand";v="99", "Brave";v="127", "Chromium";v="127"',
"sec-ch-ua-mobile": "?0",
"sec-ch-ua-platform": "Windows",
}

url = "https://www.hindustantimes.com/"
page_request = requests.get(url, headers=header)

Check failure on line 32 in web_programming/crawl_hindustan_times_and_get_top_news.py

View workflow job for this annotation

GitHub Actions / ruff

Ruff (S113)

web_programming/crawl_hindustan_times_and_get_top_news.py:32:20: S113 Probable use of `requests` call without timeout
data = page_request.content
soup = BeautifulSoup(data, "html.parser")

news = {}

counter = 0

for divtag in soup.find_all("div", {"class": "timeAgo"}):
if "liveStory" not in divtag["class"]:
head = divtag.find(class_="hdg3")
title = head.get_text()
link = divtag["data-weburl"]
imgtag = divtag.find("img")
try:
img = imgtag["data-src"]
except Exception:

Check failure on line 48 in web_programming/crawl_hindustan_times_and_get_top_news.py

View workflow job for this annotation

GitHub Actions / ruff

Ruff (BLE001)

web_programming/crawl_hindustan_times_and_get_top_news.py:48:20: BLE001 Do not catch blind exception: `Exception`
img = imgtag["src"]
news[counter] = {"title": title, "link": link, "img": img}
counter += 1

return news


if __name__ == "__main__":
fetch_ht_news()
Loading