Skip to content

Commit e5b9b80

Browse files
committed
Initial commit
0 parents  commit e5b9b80

File tree

8 files changed

+1306
-0
lines changed

8 files changed

+1306
-0
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
dist/
2+
src/hys_scraper.egg-info/

LICENSE

Lines changed: 674 additions & 0 deletions
Large diffs are not rendered by default.

README.md

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# 'Have your Say' scraper
2+
3+
![Python version](https://img.shields.io/badge/python-%3E%3D3.6-blue?logo=python) [![PyPI version](https://badge.fury.io/py/hys_scraper.svg)](https://badge.fury.io/py/hys_scraper) [![GPLv3 license](https://img.shields.io/github/license/felixrech/hys_scraper)](https://github.com/felixrech/hys_scraper/blob/master/LICENSE)
4+
5+
A small utility to scrape the European Commission's 'Have your Say' plattform ([https://ec.europa.eu/info/law/better-regulation/have-your-say](https://ec.europa.eu/info/law/better-regulation/have-your-say)). Can scrape an initiative's feedback submissions, attachments of these submissions, and the by country and by category statistics.
6+
7+
## Installation
8+
9+
```bash
10+
pip3 install hys_scraper
11+
```
12+
13+
Tested to work with Python 3.9 on a Linux machine and Google Colab notebooks.
14+
15+
## Getting started
16+
17+
To get started, you will need the publication id of the initiative you want to scrape. To get this, simply navigate to the initiative on 'Have your Say' and look at the URL - the number at the end is the publication id you will use in the next step. For example, for the [AIAct commission adoption initiative](https://ec.europa.eu/info/law/better-regulation/have-your-say/initiatives/12527-Artificial-intelligence-ethical-and-legal-requirements/feedback_en?p_id=24212003), the publication id would be `24212003`.
18+
19+
To scrape an initiative the following is sufficient (replace `24212003` with the publication id of the initiative you want to scrape):
20+
21+
```bash
22+
python3 -m hys_scraper 24212003
23+
```
24+
25+
This will create a new folder in your current working directory with the following layout:
26+
27+
```
28+
24212003_requirements_for_artificial_intelligence/
29+
├── attachments
30+
│   ├── 2488672.pdf
31+
│   ├── 2596917.pdf
32+
│   └── ...
33+
├── attachments.csv
34+
├── categories.csv
35+
├── countries.csv
36+
└── feedbacks.csv
37+
38+
1 directory, 263 files
39+
```
40+
41+
## Advanced usage
42+
43+
The command line interface has a few more arguments. For example instead of having `hys_scraper` create a folder in the local working directory to save results into, you can also manually specify a target directory.
44+
45+
```
46+
$ python3 -m hys_scraper -h
47+
Scrape feedback and statistics from the European Commission's 'Have your Say' plattform.
48+
49+
positional arguments:
50+
PID The publication id - what comes after 'p_id=' in the initiative's URL.
51+
52+
optional arguments:
53+
-h, --help show this help message and exit
54+
--dir target_dir, --target_dir target_dir
55+
Directory to save the feedback and statistics dataframes to. Defaults to creating a new
56+
folder in the current working directory.
57+
--no_attachments Whether to skip the download of attachments.
58+
--sleep_time t Minimum time between consecutive HTTP requests (in seconds).
59+
```
60+
61+
Alternatively, you can also access `hys_scraper` from Python:
62+
63+
```python
64+
from hys_scraper import HYS_Scraper
65+
feedbacks, countries, categories = HYS_Scraper("24212003").scrape()
66+
```
67+
68+
Similar options are available as for the command line interface, check out `help(HYS_Scraper)` for details.

pyproject.toml

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
[build-system]
2+
requires = ["setuptools>=61.0.0", "wheel"]
3+
build-backend = "setuptools.build_meta"
4+
5+
[project]
6+
name = "hys_scraper"
7+
version = "0.1.3"
8+
description = "Scrape feedback and statistics from the European Commission's 'Have your Say' plattform."
9+
readme = "README.md"
10+
classifiers = [
11+
"License :: OSI Approved :: GNU General Public License v3 (GPLv3)",
12+
"Programming Language :: Python",
13+
"Programming Language :: Python :: 3",
14+
]
15+
dependencies = [
16+
"requests",
17+
"pandas",
18+
]
19+
requires-python = ">=3.6"
20+
21+
22+
[project.scripts]
23+
realpython = "hys_scraper.__main__:main"

requirements.txt

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#
2+
# This file is autogenerated by pip-compile with python 3.9
3+
# To update, run:
4+
#
5+
# pip-compile pyproject.toml
6+
#
7+
certifi==2022.9.24
8+
# via requests
9+
charset-normalizer==2.1.1
10+
# via requests
11+
idna==3.4
12+
# via requests
13+
numpy==1.23.4
14+
# via pandas
15+
pandas==1.5.0
16+
# via hys-scraper (pyproject.toml)
17+
python-dateutil==2.8.2
18+
# via pandas
19+
pytz==2022.4
20+
# via pandas
21+
requests==2.28.1
22+
# via hys-scraper (pyproject.toml)
23+
six==1.16.0
24+
# via python-dateutil
25+
urllib3==1.26.12
26+
# via requests

src/hys_scraper/__init__.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
from hys_scraper.hys_scraper import HYS_Scraper
2+
3+
__version__ = "0.1.0"

src/hys_scraper/__main__.py

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
import argparse
2+
3+
from hys_scraper import HYS_Scraper
4+
5+
6+
def main():
7+
"""Runs the 'Have your Say' parser with command line arguments."""
8+
parser = argparse.ArgumentParser(
9+
prog="python3 -m hys_scraper",
10+
description="Scrape feedback and statistics from the European Commission's "
11+
+ "'Have your Say' plattform.",
12+
)
13+
parser.add_argument(
14+
"publication_id",
15+
metavar="PID",
16+
type=str,
17+
help="The publication id - what comes after 'p_id=' in the initiative's URL.",
18+
)
19+
parser.add_argument(
20+
"--dir",
21+
"--target_dir",
22+
metavar="target_dir",
23+
type=str,
24+
default=None,
25+
help="Directory to save the feedback and statistics dataframes to. "
26+
+ "Defaults to creating a new folder in the current working directory.",
27+
)
28+
parser.add_argument(
29+
"--no_attachments",
30+
action="store_true",
31+
help="Whether to skip the download of attachments.",
32+
)
33+
parser.add_argument(
34+
"--sleep_time",
35+
metavar="t",
36+
type=int,
37+
default=None,
38+
help="Minimum time between consecutive HTTP requests (in seconds).",
39+
)
40+
41+
# Deviate from scraper's default values only if user specified something
42+
args = parser.parse_args()
43+
kwargs = {}
44+
if args.dir is not None:
45+
kwargs["target_dir"] = args.dir
46+
if args.no_attachments:
47+
kwargs["download_attachments"] = False
48+
if args.sleep_time is not None:
49+
kwargs["sleep_time"] = args.sleep_time
50+
51+
# Scrape using the user-set parameters
52+
HYS_Scraper(args.publication_id, **kwargs).scrape()
53+
54+
55+
if __name__ == "__main__":
56+
main()

0 commit comments

Comments
 (0)