Skip to content

Commit f22c469

Browse files
committed
Add materials for Python Web Scraping tutorial
1 parent c7aa2c4 commit f22c469

File tree

5 files changed

+74
-0
lines changed

5 files changed

+74
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# A Practical Introduction to Web Scraping in Python
2+
3+
This repository holds the code for the Real Python [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/) tutorial.
4+
5+
## Dependencies
6+
7+
To run the examples in this repository, you need to have the dependencies installed. You should first create a virtual environment:
8+
9+
```console
10+
$ python -m venv venv
11+
$ source venv/bin/activate
12+
```
13+
14+
Then, navigate into the subfolder and install the requirements with `pip`:
15+
16+
```console
17+
(venv) $ python -m pip install -r requirements.txt
18+
```
19+
20+
## Author
21+
22+
- **Philipp Acsany**, E-mail: [[email protected]]([email protected])
23+
24+
## License
25+
26+
Distributed under the MIT license. See [`LICENSE`](../LICENSE) for more information.
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
from bs4 import BeautifulSoup
2+
from urllib.request import urlopen
3+
4+
url = "http://olympus.realpython.org/profiles/dionysus"
5+
page = urlopen(url)
6+
html = page.read().decode("utf-8")
7+
soup = BeautifulSoup(html, "html.parser")
8+
image1, image2 = soup.find_all("img")
9+
10+
print(image1.name)
11+
print(image2.name)
12+
print(soup.title.string)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
import time
2+
import mechanicalsoup
3+
4+
browser = mechanicalsoup.Browser()
5+
6+
for i in range(4):
7+
page = browser.get("http://olympus.realpython.org/dice")
8+
tag = page.soup.select("#result")[0]
9+
result = tag.text
10+
print(f"The result of your dice roll is: {result}")
11+
12+
# Wait 10 seconds if this isn't the last request
13+
if i < 3:
14+
time.sleep(10)
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import re
2+
from urllib.request import urlopen
3+
4+
url = "http://olympus.realpython.org/profiles/dionysus"
5+
page = urlopen(url)
6+
html = page.read().decode("utf-8")
7+
8+
pattern = "<title.*?>.*?</title.*?>"
9+
match_results = re.search(pattern, html, re.IGNORECASE)
10+
title = match_results.group()
11+
title = re.sub("<.*?>", "", title) # Remove HTML tags
12+
13+
print(title)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
beautifulsoup4==4.11.1
2+
certifi==2022.9.24
3+
charset-normalizer==2.1.1
4+
idna==3.4
5+
lxml==4.9.1
6+
MechanicalSoup==1.2.0
7+
requests==2.28.1
8+
soupsieve==2.3.2.post1
9+
urllib3==1.26.12

0 commit comments

Comments
 (0)