Skip to content

Commit 8a9853c

Browse files
committed
2 parents 2328db2 + 0171cee commit 8a9853c

File tree

2 files changed

+84
-0
lines changed

2 files changed

+84
-0
lines changed
Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
# Web Scraping with Beautiful Soup
2+
3+
This script performs web scraping on a CodeChef problem statement webpage using the Beautiful Soup library in Python.
4+
5+
## Description
6+
7+
The Python script utilizes the `requests` and `BeautifulSoup` libraries to extract information from a CodeChef problem statement webpage. It demonstrates the following actions:
8+
9+
- Printing the title of the webpage.
10+
- Finding and printing all links on the page.
11+
- Extracting text from paragraphs.
12+
- Extracting image URLs.
13+
- Counting and categorizing HTML tags.
14+
- Filtering and printing valid links.
15+
- Saving extracted data to a text file.
16+
17+
## Prerequisites
18+
19+
Ensure you have the following libraries installed:
20+
21+
- `requests`
22+
- `beautifulsoup4`
23+
24+
You can install them using the following commands:
25+
26+
```bash
27+
pip install requests beautifulsoup4
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
import requests
2+
from bs4 import BeautifulSoup
3+
import re
4+
5+
url = 'https://www.codechef.com/problems/TWORANGES?tab=statement'
6+
response = requests.get(url)
7+
soup = BeautifulSoup(response.content, 'html.parser')
8+
9+
# Print the title of the webpage
10+
print(f"Title: {soup.title.text}\n")
11+
12+
# Find and print all links on the page
13+
print("Links on the page:")
14+
for link in soup.find_all('a'):
15+
print(link.get('href'))
16+
17+
# Extract text from paragraphs
18+
print("\nText from paragraphs:")
19+
for paragraph in soup.find_all('p'):
20+
print(paragraph.text)
21+
22+
# Extract image URLs
23+
print("\nImage URLs:")
24+
for img in soup.find_all('img'):
25+
img_url = img.get('src')
26+
if img_url:
27+
print(img_url)
28+
29+
# Count and categorize tags
30+
print("\nTag counts:")
31+
tag_counts = {}
32+
for tag in soup.find_all():
33+
tag_name = tag.name
34+
if tag_name:
35+
tag_counts[tag_name] = tag_counts.get(tag_name, 0) + 1
36+
37+
for tag, count in tag_counts.items():
38+
print(f"{tag}: {count}")
39+
40+
# Filter and print valid links
41+
print("\nValid links:")
42+
for link in soup.find_all('a'):
43+
href = link.get('href')
44+
if href and re.match(r'^https?://', href):
45+
print(href)
46+
47+
# Save data to a file
48+
with open('webpage_data.txt', 'w') as file:
49+
file.write(f"Title: {soup.title.text}\n\n")
50+
file.write("Links on the page:\n")
51+
for link in soup.find_all('a'):
52+
file.write(f"{link.get('href')}\n")
53+
file.write("\nText from paragraphs:\n")
54+
for paragraph in soup.find_all('p'):
55+
file.write(f"{paragraph.text}\n")
56+
57+
print("\nData saved to 'webpage_data.txt'")

0 commit comments

Comments
 (0)