Skip to content

Commit 84c8aa7

Browse files
jfcalvodamianpumar
andauthored
feat: add background processing jobs (#5432)
# Description This PR add the following changes: - [x] Add `rq` to help us execute background jobs. - [x] Add a background job to update all records for a dataset when the dataset distribution strategy is updated. - [x] Change HuggingFace Dockerfile to install Redis and run `rq` workers inside honcho Procfile. - [x] Add documentation about new `ARGILLA_REDIS_URL` environment variable. - [x] Add ping to Redis so Argilla server is not started if Redis is not ready. - [x] Change Argilla docker compose file to include a container with Redis and rq workers. - [x] Update Argilla server `README.md` file adding Redis as dependency to install. - [x] Add documentation about Redis being a new Argilla server dependency. - [x] Add `BACKGROUND_NUM_WORKERS` environment variable to specify the number of workers in the HF Space container. - [ ] ~~Modify `Dockerfile` template on HF to include the environment variable #5443 ``` # (since: v2.2.0) Uncomment the next line to specify the number of background job workers to run (default: 2). # ENV BACKGROUND_NUM_WORKERS=2 ``` - [ ] Remove some `TODO` sections before merging. - [ ] Review K8s documentation (maybe delete it?). - [ ] If we want to persist Redis data on HF Spaces we can change our `Procfile` Redis process to the following: ``` redis: /usr/bin/redis-server --dbfilename argilla-redis.rdb --dir ${ARGILLA_HOME_PATH} ``` - [ ] <del>Allow tests job workers synchronously (with pytest)</del> It's not working due to asyncio stuff (running an asynchronous loop inside another one, more info here: rq/rq#1986). Closes #5431 # Benchmarks The following timings were obtained updating the distribution strategy of a dataset with 100 and 10.000 records, using a basic and an upgraded CPU on HF Spaces, with and without persistent storage and measuring how much time the background job takes to complete: CPU basic: 2 vCPU, 16GB RAM CPU upgrade: 8 vCPU, 32GB RAM * CPU basic (with persistent storage): * 100 records dataset: ~8 seconds. * 10.000 records dataset: ~9 minutes. * CPU upgrade (with persistent storage): * 100 records dataset: ~5 seconds. * 10.000 records dataset: ~6 minutes. * CPU basic (no persistent storage): * 10.000 records dataset: ~101 seconds. * CPU upgrade (no persistent storage): * 10.000 records dataset: ~62 seconds. **Type of change** - New feature (non-breaking change which adds functionality) **How Has This Been Tested** - [x] Testing it on HF Spaces. **Checklist** - I added relevant documentation - I followed the style guidelines of this project - I did a self-review of my code - I made corresponding changes to the documentation - I confirm My changes generate no new warnings - I have added tests that prove my fix is effective or that my feature works - I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/) --------- Co-authored-by: Damián Pumar <[email protected]>
1 parent fee1f5a commit 84c8aa7

File tree

24 files changed

+308
-107
lines changed

24 files changed

+308
-107
lines changed

.github/workflows/argilla-server.yml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,16 @@ jobs:
5151
ports:
5252
- 5432:5432
5353

54+
redis:
55+
image: redis
56+
options: >-
57+
--health-cmd "redis-cli ping"
58+
--health-interval 10s
59+
--health-timeout 5s
60+
--health-retries 5
61+
ports:
62+
- 6379:6379
63+
5464
env:
5565
HF_HUB_DISABLE_TELEMETRY: 1
5666

argilla-frontend/v1/infrastructure/repositories/DatasetRepository.ts

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,7 @@ export class DatasetRepository implements IDatasetRepository {
100100
);
101101

102102
revalidateCache(`/v1/datasets/${id}`);
103+
revalidateCache(`/v1/datasets/${id}/progress`);
103104

104105
return {
105106
when: data.updated_at,

argilla-server/CHANGELOG.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,11 @@ These are the section headers that we use:
1616

1717
## [Unreleased]()
1818

19+
### Added
20+
21+
- Added [`rq`](https://python-rq.org) library to process background jobs using [Redis](https://redis.io) as a dependency. ([#5432](https://github.com/argilla-io/argilla/pull/5432))
22+
- Added a new background job to update records status when a dataset distribution strategy is updated. ([#5432](https://github.com/argilla-io/argilla/pull/5432))
23+
1924
## [2.1.0](https://github.com/argilla-io/argilla/compare/v2.0.0...v2.1.0)
2025

2126
### Added

argilla-server/README.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -115,6 +115,12 @@ pdm migrate
115115
pdm server
116116
```
117117

118+
### Run RQ background workers
119+
120+
```sh
121+
pdm worker
122+
```
123+
118124
## CLI commands
119125

120126
This section list and describe the commands offered by the `argilla_server` Python package. If you need more information about the available
@@ -271,6 +277,16 @@ The `argilla_server search-engine` group of commands offers functionality to wor
271277

272278
- `python -m argilla_server search-engine reindex`: reindex all Argilla entities into search engine.
273279

280+
### Background Jobs
281+
282+
Argilla uses [RQ](https://python-rq.org) as background job manager. RQ depends on [Redis](https://redis.io) to store and retrieve information about the jobs to be processed.
283+
284+
Once that you have correctly installed Redis on your system, you can start the RQ worker by running the following CLI command:
285+
286+
```sh
287+
python -m argilla_server worker
288+
```
289+
274290
## 🫱🏾‍🫲🏼 Contribute
275291

276292
To help our community with the creation of contributions, we have created our [community](https://docs.argilla.io/latest/community/) docs. Additionally, you can always [schedule a meeting](https://calendly.com/david-berenstein-huggingface/30min) with our Developer Advocacy team so they can get you up to speed.

argilla-server/docker/argilla-hf-spaces/Dockerfile

Lines changed: 13 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,24 +10,30 @@ COPY scripts/start.sh /home/argilla
1010
COPY Procfile /home/argilla
1111
COPY requirements.txt /packages/requirements.txt
1212

13-
RUN apt-get update && apt-get install -y \
14-
apt-transport-https \
15-
gnupg \
16-
wget
13+
RUN apt-get update && \
14+
apt-get install -y apt-transport-https gnupg wget
1715

1816
# Install Elasticsearch signing key
1917
RUN wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | gpg --dearmor -o /usr/share/keyrings/elasticsearch-keyring.gpg
20-
2118
# Add Elasticsearch repository
2219
RUN echo "deb [signed-by=/usr/share/keyrings/elasticsearch-keyring.gpg] https://artifacts.elastic.co/packages/8.x/apt stable main" | tee /etc/apt/sources.list.d/elastic-8.x.list
2320

21+
# Install Redis signing key
22+
RUN wget -qO - https://packages.redis.io/gpg | gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
23+
# Add Redis repository
24+
RUN apt-get install -y lsb-release
25+
RUN echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | tee /etc/apt/sources.list.d/redis.list
26+
2427
RUN \
2528
# Create a directory where Argilla will store the data
2629
mkdir /data && \
30+
apt-get update && \
2731
# Install Elasticsearch and configure it
28-
apt-get update && apt-get install -y elasticsearch=8.8.2 && \
32+
apt-get install -y elasticsearch=8.8.2 && \
2933
chown -R argilla:argilla /usr/share/elasticsearch /etc/elasticsearch /var/lib/elasticsearch /var/log/elasticsearch && \
3034
chown argilla:argilla /etc/default/elasticsearch && \
35+
# Install Redis
36+
apt-get install -y redis && \
3137
# Install image dependencies
3238
pip install -r /packages/requirements.txt && \
3339
chmod +x /home/argilla/start.sh && \
@@ -52,6 +58,7 @@ ENV ELASTIC_CONTAINER=true
5258
ENV ES_JAVA_OPTS="-Xms1g -Xmx1g"
5359

5460
ENV ARGILLA_HOME_PATH=/data/argilla
61+
ENV BACKGROUND_NUM_WORKERS=2
5562
ENV REINDEX_DATASETS=1
5663

5764
CMD ["/bin/bash", "start.sh"]
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,4 @@
11
elastic: /usr/share/elasticsearch/bin/elasticsearch
2+
redis: /usr/bin/redis-server
3+
worker: sleep 30; rq worker-pool --num-workers ${BACKGROUND_NUM_WORKERS}
24
argilla: sleep 30; /bin/bash start_argilla_server.sh
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,2 @@
11
honcho
2+
rq ~= 1.16.2

argilla-server/docker/server/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,4 +25,3 @@ Besides the common environment variables defined in docs, this Docker image prov
2525
- `API_KEY`: If provided, the owner api key. When `USERNAME` and `PASSWORD` are provided and `API_KEY` is empty, a new random value will be generated (Default: `""`).
2626

2727
- `REINDEX_DATASET`: If `true` or `1`, the datasets will be reindexed in the search engine. This is needed when some search configuration changed or data must be refreshed (Default: `0`).
28-

argilla-server/pdm.lock

Lines changed: 32 additions & 1 deletion
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

argilla-server/pyproject.toml

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,13 +47,15 @@ dependencies = [
4747
"httpx~=0.26.0",
4848
"oauthlib ~= 3.2.0",
4949
"social-auth-core ~= 4.5.0",
50+
# Background processing
51+
"rq ~= 1.16.2",
5052
# Info status
5153
"psutil >= 5.8, <5.10",
5254
# Telemetry
5355
"segment-analytics-python == 2.2.0",
5456
# For logging, tracebacks, printing, progressbars
5557
"rich != 13.1.0",
56-
# for CLI
58+
# For CLI
5759
"typer >= 0.6.0, < 0.10.0", # spaCy only supports typer<0.10.0
5860
"packaging>=23.2",
5961
"psycopg2-binary>=2.9.9",
@@ -169,10 +171,11 @@ _.env_file = ".env.dev"
169171
cli = { cmd = "python -m argilla_server.cli" }
170172
server = { cmd = "uvicorn argilla_server:app --port 6900 --reload" }
171173
migrate = { cmd = "alembic upgrade head" }
174+
worker = { cmd = "python -m argilla_server worker" }
172175
server-dev.composite = [
173176
"migrate",
174177
"cli database users create_default",
175-
"server"
178+
"server",
176179
]
177180
test = { cmd = "pytest", env_file = ".env.test" }
178181

0 commit comments

Comments
 (0)