You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -15,7 +15,7 @@ Since it's called _hydra_, it also has mythical powers embedded:
15
15
- if the remote resource is a geojson, convert it to PMTiles to offer another distribution of the data
16
16
- send crawl and analysis info to a udata instance
17
17
18
-
## Architecture schema
18
+
## 🏗️ Architecture schema
19
19
20
20
The architecture for the full workflow is the following:
21
21
@@ -26,15 +26,15 @@ The hydra crawler is one of the components of the architecture. It will check if
26
26
27
27

28
28
29
-
## Dependencies
29
+
## 📦 Dependencies
30
30
31
31
This project uses `libmagic`, which needs to be installed on your system, eg:
32
32
33
33
`brew install libmagic` on MacOS, or `sudo apt-get install libmagic-dev` on linux.
34
34
35
35
This project uses Python >=3.11 and [Poetry](https://python-poetry.org) >= 2.0.0 to manage dependencies.
36
36
37
-
## CLI
37
+
## 🖥️ CLI
38
38
39
39
### Create database structure
40
40
@@ -47,7 +47,7 @@ Install udata-hydra dependencies and cli.
47
47
48
48
`poetry run udata-hydra load-catalog`
49
49
50
-
## Crawler
50
+
## 🕷️ Crawler
51
51
52
52
`poetry run udata-hydra-crawl`
53
53
@@ -57,11 +57,11 @@ It will crawl (forever) the catalog according to the config set in `config.toml`
57
57
58
58
The crawler will start with URLs never checked and then proceed with URLs crawled before `CHECK_DELAYS` interval. It will then wait until something changes (catalog or time).
59
59
60
-
There's a by-domain backoff mecanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.
60
+
There's a by-domain backoff mechanism. The crawler will wait when, for a given domain in a given batch, `BACKOFF_NB_REQ` is exceeded in a period of `BACKOFF_PERIOD` seconds. It will retry until the backoff is lifted.
61
61
62
62
If an URL matches one of the `EXCLUDED_PATTERNS`, it will never be checked.
63
63
64
-
## Worker
64
+
## ⚙️ Worker
65
65
66
66
A job queuing system is used to process long-running tasks. Launch the worker with the following command:
67
67
@@ -75,31 +75,31 @@ To empty all the queues:
75
75
76
76
`poetry run rq empty -c udata_hydra.worker low default high`
77
77
78
-
## CSV conversion to database
78
+
## 📊 CSV conversion to database
79
79
80
-
Converted CSV tables will be stored in the database specified via `config.DATABASE_URL_CSV`. For tests it's same database as for the catalog. Locally, `docker compose` will launch two distinct database containers.
80
+
Converted CSV tables will be stored in the database specified via `config.DATABASE_URL_CSV`. For tests it's the same database as for the catalog. Locally, `docker compose` will launch two distinct database containers.
81
81
82
-
## Tests
82
+
## 🧪 Tests
83
83
84
84
To run the tests, you need to launch the database, the test database, and the Redis broker with `docker compose -f docker-compose.yml -f docker-compose.test.yml -f docker-compose.broker.yml up -d`.
85
85
86
-
Make sure the dev dependecies are installed with `poetry install --extras dev`.
86
+
Make sure the dev dependencies are installed with `poetry install --extras dev`.
87
87
88
88
Then you can run the tests with `poetry run pytest`.
89
89
90
-
To run a specific test file, you can pass the path to the file to pytest, like this: `poetry run pytest tests/test_app.py`.
90
+
To run a specific test file, you can pass the path to the file to pytest, like this: `poetry run pytest tests/test_file.py`.
91
91
92
-
To run a specific test function, you can pass the path to the file and the name of the function to pytest, like this: `poetry run pytest tests/test_app.py::test_get_latest_check`.
92
+
To run a specific test function, you can pass the path to the file and the name of the function to pytest, like this: `poetry run pytest tests/test_api/test_api_checks.py::test_get_latest_check`.
93
93
94
94
If you would like to see print statements as they are executed, you can pass the -s flag to pytest (`poetry run pytest -s`). However, note that this can sometimes be difficult to parse.
95
95
96
-
### Tests coverage
96
+
### 📈 Tests coverage
97
97
98
98
Pytest automatically uses the `coverage` package to generate a coverage report, which is displayed at the end of the test run in the terminal.
99
-
The coverage is configured in the `pypoject.toml` file, in the `[tool.pytest.ini_options]` section.
99
+
The coverage is configured in the `pyproject.toml` file, in the `[tool.pytest.ini_options]` section.
100
100
You can also override the coverage report configuration when running the tests by passing some flags like `--cov-report` to pytest. See [the pytest-cov documentation](https://pytest-cov.readthedocs.io/en/latest/config.html) for more information.
101
101
102
-
## API
102
+
## 🔌 API
103
103
104
104
The API will need a Bearer token for each request on protected endpoints (any endpoint that isn't a `GET`).
105
105
The token is configured in the `config.toml` file as `API_KEY`, and has a default value set in the `udata_hydra/config_default.toml` file.
@@ -108,15 +108,15 @@ If you're using hydra as an external service to receive resource events from [ud
108
108
API key in its `udata.cfg` file:
109
109
110
110
```python
111
-
#Wether udata should publish the resource events
111
+
#Whether udata should publish the resource events
112
112
PUBLISH_ON_RESOURCE_EVENTS=True
113
113
# Where to publish the events
114
114
RESOURCES_ANALYSER_URI="http://localhost:8000"
115
115
# The API key that hydra needs
116
116
RESOURCES_ANALYSER_API_KEY="api_key_to_change"
117
117
```
118
118
119
-
### Run
119
+
### 🚀 Run
120
120
121
121
```bash
122
122
poetry install
@@ -125,14 +125,14 @@ poetry run adev runserver udata_hydra/app.py
125
125
By default, the app will listen on `localhost:8000`.
126
126
You can check the status of the app with `curl http://localhost:8000/api/health`.
127
127
128
-
### Routes/endpoints
128
+
### 🛣️ Routes/endpoints
129
129
130
130
The API serves the following endpoints:
131
131
132
132
*Related to checks:*
133
133
-`GET` on `/api/checks/latest?url={url}&resource_id={resource_id}` to get the latest check for a given URL and/or `resource_id`
134
134
-`GET` on `/api/checks/all?url={url}&resource_id={resource_id}` to get all checks for a given URL and/or `resource_id`
135
-
-`GET` on `/api/checks/aggregate?group_by={column}&created_at={date}` to get checks occurences grouped by a `column` for a specific `date`
135
+
-`GET` on `/api/checks/aggregate?group_by={column}&created_at={date}` to get checks occurrences grouped by a `column` for a specific `date`
136
136
137
137
*Related to resources:*
138
138
-`GET` on `/api/resources/{resource_id}` to get a resource in the DB "catalog" table from its `resource_id`
@@ -146,7 +146,7 @@ The API serves the following endpoints:
146
146
> -`POST` on `/api/resource/deleted` -> use `DELETE` on `/api/resources/` instead
147
147
148
148
*Related to resources exceptions:*
149
-
-`GET` on `/api/resources-exceptions` to get the list all resources exceptions
149
+
-`GET` on `/api/resources-exceptions` to get the list of all resources exceptions
150
150
-`POST` on `/api/resources-exceptions` to create a new resource exception in the DB
151
151
-`PUT` on `/api/resources-exceptions/{resource_id}` to update a resource exception in the DB
152
152
-`DELETE` on `/api/resources-exceptions/{resource_id}` to delete a resource exception from the DB
@@ -157,8 +157,8 @@ The API serves the following endpoints:
157
157
-`GET` on `/api/stats` to get the crawling stats
158
158
-`GET` on `/api/health` to get the API version number and environment
159
159
160
-
You may want to you a helper such as [Bruno](https://www.usebruno.com/) to handle API calls, in which case all the endpoints are ready to use [here](https://github.com/datagouv/api-calls).
161
-
More details about some enpoints are provided below with examples, but not for all of them:
160
+
You may want to use a helper such as [Bruno](https://www.usebruno.com/) to handle API calls, in which case all the endpoints are ready to use [here](https://github.com/datagouv/api-calls).
161
+
More details about some endpoints are provided below with examples, but not for all of them:
The webhook integration sends HTTP messages to `udata` when resources are analysed or checked to fill resources extras.
427
427
428
-
Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criterions:
428
+
Regarding analysis, there is a phase called "change detection". It will try to guess if a resource has been modified based on different criteria:
429
429
- harvest modified date in catalog
430
430
- content-length and last-modified headers
431
431
- checksum comparison over time
@@ -442,9 +442,9 @@ The payload should look something like:
442
442
}
443
443
```
444
444
445
-
## Development
445
+
## 🛠️ Development
446
446
447
-
### docker compose
447
+
### 🐳 docker compose
448
448
449
449
Multiple docker-compose files are provided:
450
450
- a minimal `docker-compose.yml` with two PostgreSQL containers (one for catalog and metadata, the other for converted CSV to database)
@@ -453,17 +453,17 @@ Multiple docker-compose files are provided:
453
453
454
454
NB: you can launch compose from multiple files like this: `docker compose -f docker-compose.yml -f docker-compose.test.yml up`
455
455
456
-
### Logging & Debugging
456
+
### 📝 Logging & Debugging
457
457
458
458
The log level can be adjusted using the environment variable LOG_LEVEL.
459
459
For example, to set the log level to `DEBUG` when initializing the database, use `LOG_LEVEL="DEBUG" udata-hydra init_db `.
460
460
461
-
### Writing a migration
461
+
### 📋 Writing a migration
462
462
463
463
1. Add a file named `migrations/{YYYYMMDD}_{description}.sql` and write the SQL you need to perform migration.
464
-
2. `udata-hydra migrate` will migrate the database as needeed.
464
+
2. `udata-hydra migrate` will migrate the database as needed.
465
465
466
-
## Deployment
466
+
## 🚀 Deployment
467
467
468
468
3 services need to be deployed for the full stack to run:
469
469
- worker
@@ -474,7 +474,7 @@ Refer to each section to learn how to launch them. The only differences from dev
474
474
- use `HYDRA_SETTINGS` env var to point to your custom `config.toml`
475
475
- use `HYDRA_APP_SOCKET_PATH` to configure where aiohttp should listen to a [reverse proxy connection (eg nginx)](https://docs.aiohttp.org/en/stable/deployment.html#nginx-configuration) and use `udata-hydra-app` to launch the app server
476
476
477
-
## Contributing
477
+
## 🤝 Contributing
478
478
479
479
Before contributing to the repository and making any PR, it is necessary to initialize the pre-commit hooks:
480
480
```bash
@@ -487,6 +487,6 @@ If you cannot use pre-commit, it is necessary to format, lint, and sort imports
487
487
poetry run ruff check --fix . && poetry run ruff format .
488
488
```
489
489
490
-
### Releases
490
+
### 🏷️ Releases
491
491
492
492
The release process uses [bump'X](https://github.com/datagouv/bumpx).
0 commit comments