Skip to content

Commit fe609dd

Browse files
authored
Merge pull request #311 from realpython/python-web-scraping-practical-introduction
Add materials for Python Web Scraping tutorial
2 parents 043e2a5 + 3607505 commit fe609dd

File tree

5 files changed

+76
-0
lines changed

5 files changed

+76
-0
lines changed
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
# A Practical Introduction to Web Scraping in Python
2+
3+
This repository holds the code for the Real Python [A Practical Introduction to Web Scraping in Python](https://realpython.com/python-web-scraping-practical-introduction/) tutorial.
4+
5+
## Dependencies
6+
7+
To run the examples in this repository, you need to have the dependencies installed. You should first create a virtual environment:
8+
9+
```console
10+
$ python -m venv venv
11+
$ source venv/bin/activate
12+
```
13+
14+
Then, navigate into the subfolder and install the requirements with `pip`:
15+
16+
```console
17+
(venv) $ python -m pip install -r requirements.txt
18+
```
19+
20+
## Author
21+
22+
- [David Amos](https://realpython.com/team/damos/)
23+
24+
## License
25+
26+
Distributed under the MIT license. See [`LICENSE`](../LICENSE) for more information.
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
from urllib.request import urlopen
2+
3+
from bs4 import BeautifulSoup
4+
5+
url = "http://olympus.realpython.org/profiles/dionysus"
6+
page = urlopen(url)
7+
html = page.read().decode("utf-8")
8+
soup = BeautifulSoup(html, "html.parser")
9+
image1, image2 = soup.find_all("img")
10+
11+
print(image1.name)
12+
print(image2.name)
13+
print(soup.title.string)
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
import time
2+
3+
import mechanicalsoup
4+
5+
browser = mechanicalsoup.Browser()
6+
7+
for i in range(4):
8+
page = browser.get("http://olympus.realpython.org/dice")
9+
tag = page.soup.select("#result")[0]
10+
result = tag.text
11+
print(f"The result of your dice roll is: {result}")
12+
13+
# Wait 10 seconds if this isn't the last request
14+
if i < 3:
15+
time.sleep(10)
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
import re
2+
from urllib.request import urlopen
3+
4+
url = "http://olympus.realpython.org/profiles/dionysus"
5+
page = urlopen(url)
6+
html = page.read().decode("utf-8")
7+
8+
pattern = "<title.*?>.*?</title.*?>"
9+
match_results = re.search(pattern, html, re.IGNORECASE)
10+
title = match_results.group()
11+
title = re.sub("<.*?>", "", title) # Remove HTML tags
12+
13+
print(title)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
beautifulsoup4==4.11.1
2+
certifi==2022.9.24
3+
charset-normalizer==2.1.1
4+
idna==3.4
5+
lxml==4.9.1
6+
MechanicalSoup==1.2.0
7+
requests==2.28.1
8+
soupsieve==2.3.2.post1
9+
urllib3==1.26.12

0 commit comments

Comments
 (0)