Skip to content

Commit f43a0d2

Browse files
severoalbertvillanovaAndreaFrancis
authored
Move workers/datasets_based to services/worker (#800)
* feat: 🎸 use primitive parameters, add release, add tests * style: 💄 fix style * feat: 🎸 use primitive parameters, add release, add tests * style: 💄 fix style * log a warning when the migration cannot access database thanks @AndreaFrancis * Update libs/libcommon/tests/test_resources.py Co-authored-by: Albert Villanova del Moral <[email protected]> * feat: 🎸 use primitive parameters, add release, add tests * style: 💄 fix style * feat: 🎸 move workers/datasets_based to services/workers * fix: 🐛 fix the Helm chart * feat: 🎸 upgrade the minor versions of the packages and update the kenlm source * style: 💄 fix style * test: 💍 fix the tests if the runner is slow * fix: 🐛 refactor to avoid having worker.py in the root Having worker.py at the root is not allowed since it's also the name of the package. Now: - WorkerLoop becomes Loop - Worker becomes JobRunner The terms are more accurate. Indeed, a JobRunner only processes one job. * Update services/worker/pyproject.toml Co-authored-by: Andrea Francis Soria Jimenez <[email protected]> * Update services/worker/src/worker/config.py Co-authored-by: Andrea Francis Soria Jimenez <[email protected]> * Revert "Update services/worker/src/worker/config.py" This reverts commit 1bd9324. --------- Co-authored-by: Albert Villanova del Moral <[email protected]> Co-authored-by: Andrea Francis Soria Jimenez <[email protected]>
1 parent 67c2eee commit f43a0d2

File tree

102 files changed

+1509
-638
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

102 files changed

+1509
-638
lines changed

.github/workflows/build_push_docker_hub.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,8 @@ jobs:
2020
project: admin
2121
- directory: services
2222
project: api
23-
- directory: workers
24-
project: datasets_based
23+
- directory: services
24+
project: worker
2525
runs-on: "ubuntu-latest"
2626
steps:
2727
- name: Checkout repository

.github/workflows/e2e.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,6 @@ on:
1111
- 'e2e/**'
1212
- 'libs/**'
1313
- 'services/**'
14-
- 'workers/**'
1514
- 'chart/static-files/openapi.json'
1615
- '.github/workflows/_e2e_tests.yml'
1716
- '.github/workflows/_quality-python.yml'
@@ -23,7 +22,6 @@ on:
2322
- 'e2e/**'
2423
- 'libs/**'
2524
- 'services/**'
26-
- 'workers/**'
2725
- 'chart/static-files/openapi.json'
2826
- '.github/workflows/_e2e_tests.yml'
2927
- '.github/workflows/_quality-python.yml'
Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# Copyright 2022 The HuggingFace Authors.
33

4-
name: workers/datasets_based
4+
name: services/worker
55
on:
66
workflow_dispatch:
77
push:
88
branches:
99
- main
1010
paths:
1111
- 'libs/libcommon/**'
12-
- 'workers/datasets_based/**'
13-
- '.github/workflows/w-datasets_based.yml'
12+
- 'services/worker/**'
13+
- '.github/workflows/s-worker.yml'
1414
- '.github/workflows/_quality-python.yml'
1515
- '.github/workflows/_unit-tests-python.yml'
1616
- 'tools/docker-compose-mongo.yml'
1717
- 'vendors/'
1818
pull_request:
1919
paths:
2020
- 'libs/libcommon/**'
21-
- 'workers/datasets_based/**'
22-
- '.github/workflows/w-datasets_based.yml'
21+
- 'services/worker/**'
22+
- '.github/workflows/s-worker.yml'
2323
- '.github/workflows/_quality-python.yml'
2424
- '.github/workflows/_unit-tests-python.yml'
2525
- 'tools/docker-compose-mongo.yml'
@@ -28,10 +28,10 @@ jobs:
2828
quality:
2929
uses: ./.github/workflows/_quality-python.yml
3030
with:
31-
working-directory: workers/datasets_based
31+
working-directory: services/worker
3232
is-datasets-worker: true
3333
unit-tests:
3434
uses: ./.github/workflows/_unit-tests-python.yml
3535
with:
36-
working-directory: workers/datasets_based
36+
working-directory: services/worker
3737
is-datasets-worker: true

.vscode/monorepo.code-workspace

Lines changed: 5 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -25,21 +25,20 @@
2525
"path": "../services/api"
2626
},
2727
{
28-
"name": "services/reverse-proxy",
29-
"path": "../services/reverse-proxy"
28+
"name": "services/worker",
29+
"path": "../services/worker"
3030
},
3131
{
32-
"name": "workers/datasets_based",
33-
"path": "../workers/datasets_based"
32+
"name": "services/reverse-proxy",
33+
"path": "../services/reverse-proxy"
3434
}
3535
],
3636
"settings": {
3737
"files.exclude": {
3838
"e2e": true,
3939
"jobs": true,
4040
"libs": true,
41-
"services": true,
42-
"workers": true
41+
"services": true
4342
},
4443
"python.formatting.provider": "black",
4544
"python.linting.enabled": true,

DEVELOPER_GUIDE.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ make dev-start
2828
In development mode, you don't need to rebuild the docker images to apply a change in a worker.
2929
You can just restart the worker's docker container and it will apply your changes.
3030

31-
To install a single job (in [jobs](./jobs)), library (in [libs](./libs)), service (in [services](./services)) or worker (in [workers](./workers)), go to their respective directory, and install Python 3.9 (consider [pyenv](https://github.com/pyenv/pyenv)) and [poetry](https://python-poetry.org/docs/master/#installation) (don't forget to add `poetry` to the `PATH` environment variable).
31+
To install a single job (in [jobs](./jobs)), library (in [libs](./libs)) or service (in [services](./services)), go to their respective directory, and install Python 3.9 (consider [pyenv](https://github.com/pyenv/pyenv)) and [poetry](https://python-poetry.org/docs/master/#installation) (don't forget to add `poetry` to the `PATH` environment variable).
3232

3333
If you use pyenv:
3434

@@ -51,20 +51,19 @@ If you use VSCode, it might be useful to use the ["monorepo" workspace](./.vscod
5151

5252
## Architecture
5353

54-
The repository is structured as a monorepo, with Python libraries and applications in [jobs](./jobs)), [libs](./libs), [services](./services) and [workers](./workers):
54+
The repository is structured as a monorepo, with Python libraries and applications in [jobs](./jobs)), [libs](./libs) and [services](./services):
5555

5656
- [jobs](./jobs) contains the one-time jobs run by Helm before deploying the pods. For now, the only job migrates the databases when needed.
5757
- [libs](./libs) contains the Python libraries used by the services and workers. For now, the only library is [libcommon](./libs/libcommon), which contains the common code for the services and workers.
58-
- [services](./services) contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point) and the reverse proxy.
59-
- [workers](./workers) contains the workers that process the queue asynchronously: they get a "job" (caution: not the Helm jobs, but the jobs stored in the queue), process the expected response for the associated endpoint, and store the response in the cache.
58+
- [services](./services) contains the applications: the public API, the admin API (which is separated from the public API and might be published under its own domain at some point), the reverse proxy, and the worker that processes the queue asynchronously: it gets a "job" (caution: the jobs stored in the queue, not the Helm jobs), processes the expected response for the associated endpoint, and stores the response in the cache.
6059

6160
If you have access to the internal HF notion, see https://www.notion.so/huggingface2/Datasets-server-464848da2a984e999c540a4aa7f0ece5.
6261

6362
The application is distributed in several components.
6463

6564
[api](./services/api) is a web server that exposes the [API endpoints](https://huggingface.co/docs/datasets-server). Apart from some endpoints (`valid`, `is-valid`), all the responses are served from pre-computed responses. That's the main point of this project: generating these responses takes time, and the API server provides this service to the users.
6665

67-
The precomputed responses are stored in a Mongo database called "cache". They are computed by [workers](./workers) which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see [libcommon](./libs/libcommon)).
66+
The precomputed responses are stored in a Mongo database called "cache". They are computed by [workers](./services/worker) which take their jobs from a job queue stored in a Mongo database called "queue", and store the results (error or valid response) into the "cache" (see [libcommon](./libs/libcommon)).
6867

6968
The API service exposes the `/webhook` endpoint which is called by the Hub on every creation, update or deletion of a dataset on the Hub. On deletion, the cached responses are deleted. On creation or update, a new job is appended in the "queue" database.
7069

@@ -156,7 +155,7 @@ GITHUB_TOKEN=xxx
156155

157156
## Mac OS
158157

159-
To install the [datasets based worker](./workers/datasets_based) on Mac OS, you can follow the next steps.
158+
To install the [datasets based worker](./services/worker) on Mac OS, you can follow the next steps.
160159

161160
### First: as an administrator
162161

@@ -219,7 +218,7 @@ $ pyenv install 3.9.15
219218
Check that the expected local version of Python is used:
220219

221220
```bash
222-
$ cd workers/datasets_based
221+
$ cd services/worker
223222
$ python --version
224223
Python 3.9.15
225224
```

chart/env/dev.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -38,11 +38,10 @@ images:
3838
useGlobalRegistry: false
3939
repository: datasets-server-services-api
4040
tag: sha-27ad2f7
41-
workers:
42-
datasetsBased:
41+
worker:
4342
registry: huggingface
4443
useGlobalRegistry: false
45-
repository: datasets-server-workers-datasets_based
44+
repository: datasets-server-services-worker
4645
tag: sha-27ad2f7
4746

4847
secrets:

chart/env/prod.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,11 +28,10 @@ images:
2828
useGlobalRegistry: false
2929
repository: datasets-server-services-api
3030
tag: sha-27ad2f7
31-
workers:
32-
datasetsBased:
31+
worker:
3332
registry: huggingface
3433
useGlobalRegistry: false
35-
repository: datasets-server-workers-datasets_based
34+
repository: datasets-server-services-worker
3635
tag: sha-27ad2f7
3736

3837
secrets:

chart/templates/_helpers.tpl

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ imagePullSecrets:
8383
{{- end -}}
8484

8585
{{- define "workers.datasetsBased.image" -}}
86-
{{ include "datasetsServer.images.image" (dict "imageRoot" .Values.images.workers.datasetsBased "global" .Values.global.huggingface) }}
86+
{{ include "datasetsServer.images.image" (dict "imageRoot" .Values.images.services.worker "global" .Values.global.huggingface) }}
8787
{{- end -}}
8888

8989
{{- define "image.imagePullSecrets" -}}

chart/values.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -35,11 +35,10 @@ images:
3535
useGlobalRegistry: false
3636
repository: datasets-server-services-api
3737
tag: sha-27ad2f7
38-
workers:
39-
datasetsBased:
38+
worker:
4039
registry: huggingface
4140
useGlobalRegistry: false
42-
repository: datasets-server-workers-datasets_based
41+
repository: datasets-server-services-worker
4342
tag: sha-27ad2f7
4443

4544

docs/source/server.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ You might've noticed the `/valid` and `/is-valid` endpoints don't have a job in
2525

2626
Workers are responsible for executing the jobs in the queue. They complete the actual preprocessing requests, such as getting a list of splits and configurations. The workers can be controlled by configurable environment variables, like the minimum or the maximum number of rows returned by a worker or the maximum number of jobs to start per dataset user or organization.
2727

28-
Take a look at the [workers configuration](https://github.com/huggingface/datasets-server/tree/main/workers/datasets_based#configuration) for a complete list of the environment variables if you're interested in learning more.
28+
Take a look at the [workers configuration](https://github.com/huggingface/datasets-server/tree/main/services/worker#configuration) for a complete list of the environment variables if you're interested in learning more.
2929

3030
## Cache
3131

0 commit comments

Comments
 (0)