You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Move workers/datasets_based to services/worker (#800)
* feat: 🎸 use primitive parameters, add release, add tests
* style: 💄 fix style
* feat: 🎸 use primitive parameters, add release, add tests
* style: 💄 fix style
* log a warning when the migration cannot access database
thanks @AndreaFrancis
* Update libs/libcommon/tests/test_resources.py
Co-authored-by: Albert Villanova del Moral <[email protected]>
* feat: 🎸 use primitive parameters, add release, add tests
* style: 💄 fix style
* feat: 🎸 move workers/datasets_based to services/workers
* fix: 🐛 fix the Helm chart
* feat: 🎸 upgrade the minor versions of the packages
and update the kenlm source
* style: 💄 fix style
* test: 💍 fix the tests if the runner is slow
* fix: 🐛 refactor to avoid having worker.py in the root
Having worker.py at the root is not allowed since it's also the name of
the package.
Now:
- WorkerLoop becomes Loop
- Worker becomes JobRunner
The terms are more accurate. Indeed, a JobRunner only processes one job.
* Update services/worker/pyproject.toml
Co-authored-by: Andrea Francis Soria Jimenez <[email protected]>
* Update services/worker/src/worker/config.py
Co-authored-by: Andrea Francis Soria Jimenez <[email protected]>
* Revert "Update services/worker/src/worker/config.py"
This reverts commit 1bd9324.
---------
Co-authored-by: Albert Villanova del Moral <[email protected]>
Co-authored-by: Andrea Francis Soria Jimenez <[email protected]>
Copy file name to clipboardExpand all lines: DEVELOPER_GUIDE.md
+6-7Lines changed: 6 additions & 7 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ make dev-start
28
28
In development mode, you don't need to rebuild the docker images to apply a change in a worker.
29
29
You can just restart the worker's docker container and it will apply your changes.
30
30
31
-
To install a single job (in [jobs](./jobs)), library (in [libs](./libs)), service (in [services](./services)) or worker (in [workers](./workers)), go to their respective directory, and install Python 3.9 (consider [pyenv](https://github.com/pyenv/pyenv)) and [poetry](https://python-poetry.org/docs/master/#installation) (don't forget to add `poetry` to the `PATH` environment variable).
31
+
To install a single job (in [jobs](./jobs)), library (in [libs](./libs)) or service (in [services](./services)), go to their respective directory, and install Python 3.9 (consider [pyenv](https://github.com/pyenv/pyenv)) and [poetry](https://python-poetry.org/docs/master/#installation) (don't forget to add `poetry` to the `PATH` environment variable).
32
32
33
33
If you use pyenv:
34
34
@@ -51,20 +51,19 @@ If you use VSCode, it might be useful to use the ["monorepo" workspace](./.vscod
51
51
52
52
## Architecture
53
53
54
-
The repository is structured as a monorepo, with Python libraries and applications in [jobs](./jobs)), [libs](./libs), [services](./services)and [workers](./workers):
54
+
The repository is structured as a monorepo, with Python libraries and applications in [jobs](./jobs)), [libs](./libs)and [services](./services):
55
55
56
56
-[jobs](./jobs) contains the one-time jobs run by Helm before deploying the pods. For now, the only job migrates the databases when needed.
57
57
-[libs](./libs) contains the Python libraries used by the services and workers. For now, the only library is [libcommon](./libs/libcommon), which contains the common code for the services and workers.
58
-
-[services](./services) contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point) and the reverse proxy.
59
-
-[workers](./workers) contains the workers that process the queue asynchronously: they get a "job" (caution: not the Helm jobs, but the jobs stored in the queue), process the expected response for the associated endpoint, and store the response in the cache.
58
+
-[services](./services) contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point), the reverse proxy, and the worker that processes the queue asynchronously: it gets a "job" (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the cache.
60
59
61
60
If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.
62
61
63
62
The application is distributed in several components.
64
63
65
64
[api](./services/api) is a web server that exposes the [API endpoints](https://huggingface.co/docs/datasets-server). Apart from some endpoints (`valid`, `is-valid`), all the responses are served from pre-computed responses. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
66
65
67
-
The precomputed responses are stored in a Mongo database called "cache". They are computed by [workers](./workers) which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see [libcommon](./libs/libcommon)).
66
+
The precomputed responses are stored in a Mongo database called "cache". They are computed by [workers](./services/worker) which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see [libcommon](./libs/libcommon)).
68
67
69
68
The API service exposes the `/webhook` endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
70
69
@@ -156,7 +155,7 @@ GITHUB_TOKEN=xxx
156
155
157
156
## Mac OS
158
157
159
-
To install the [datasets based worker](./workers/datasets_based) on Mac OS, you can follow the next steps.
158
+
To install the [datasets based worker](./services/worker) on Mac OS, you can follow the next steps.
160
159
161
160
### First: as an administrator
162
161
@@ -219,7 +218,7 @@ $ pyenv install 3.9.15
219
218
Check that the expected local version of Python is used:
Copy file name to clipboardExpand all lines: docs/source/server.mdx
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ You might've noticed the `/valid` and `/is-valid` endpoints don't have a job in
25
25
26
26
Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and configurations. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization.
27
27
28
-
Take a look at the [workers configuration](https://github.com/huggingface/datasets-server/tree/main/workers/datasets_based#configuration) for a complete list of the environment variables if you're interested in learning more.
28
+
Take a look at the [workers configuration](https://github.com/huggingface/datasets-server/tree/main/services/worker#configuration) for a complete list of the environment variables if you're interested in learning more.
0 commit comments