Skip to content

Commit 9daefee

Browse files
burtenshawpre-commit-ci[bot]frascuchondavidberenstein1957dvsrepo
authored
[FEAT] hub integration with dataset and configuration (#5161)
This is an experimental WIP pr to get feedback on the approach. I've migrated the code out of the v1 argilla client, and reimplemented for v2 changes. It would be great to get feedback on issues like this: - testing with a mocked hub api. - dealing with responses across argilla servers and mismatched user_ids. - dealing with dependencies like `huggingface_hub`. I like the decorator used in the v1 client. Here's a dataset that I pushed. Still uses default readme from v1: https://huggingface.co/datasets/burtenshaw/test-argilla-dataset To test this implementation do: ```python import uuid from datetime import datetime import argilla as rg client = rg.Argilla(api_key="owner.apikey") workspace = client.workspaces[0] mock_dataset_name = ( f"test_add_record_with_suggestions {datetime.now().strftime('%Y%m%d%H%M%S')}" ) mock_data = [ { "text": "Hello World, how are you?", "label": "positive", "id": uuid.uuid4(), "comment": "I'm doing great, thank you!", "topics": ["topic1", "topic2"], "topics.score": [0.9, 0.8], }, { "text": "Hello World, how are you?", "label": "negative", "id": uuid.uuid4(), "comment": "I'm doing great, thank you!", "topics": ["topic3"], "topics.score": [0.9], }, { "text": "Hello World, how are you?", "label": "positive", "id": uuid.uuid4(), "comment": "I'm doing great, thank you!", "comment_score": 0.9, # This field will be ignored because it is not in the mapping "rating": 1, "topics": ["topic1", "topic2", "topic3"], "topics.score": [0.9, 0.8, 0.7], "ranking": ["label1", "label2", "label3"], "span": [ { "start": 0, "end": 5, "label": "label1", }, { "start": 6, "end": 11, "label": "label2", }, { "start": 12, "end": 17, "label": "label3", }, ], "vector": [1, 2, 3], }, ] settings = rg.Settings( fields=[ rg.TextField(name="text"), ], questions=[ rg.LabelQuestion(name="label", labels=["positive", "negative"]), rg.RatingQuestion(name="rating", values=[1, 2, 3, 4, 5]), rg.RankingQuestion(name="ranking", values=["label1", "label2", "label3"]), rg.TextQuestion(name="comment", use_markdown=False), rg.MultiLabelQuestion( name="topics", labels=["topic1", "topic2", "topic3"], labels_order="suggestion", ), rg.SpanQuestion( name="span", labels=["label1", "label2", "label3"], field="text" ), ], metadata=[ rg.FloatMetadataProperty(name="comment_score"), ], vectors=[ rg.VectorField(name="vector", dimensions=3), ], ) dataset = rg.Dataset( name=mock_dataset_name, settings=settings, client=client, ) dataset.create() dataset.records.log( mock_data, mapping={ "comment": "comment.suggestion", "comment_score": "comment.suggestion.score", # This field will be ignored because it is not in the mapping "topics": "topics.suggestion", "topics.score": "topics.suggestion.score", "label": "label.response", "span": "span", }, ) dataset.to_hub(repo_id="burtenshaw/test-argilla-dataset") pulled_dataset = from_huggingface("burtenshaw/test-argilla-dataset") ``` --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Paco Aranda <[email protected]> Co-authored-by: David Berenstein <[email protected]> Co-authored-by: Daniel Vila Suero <[email protected]> Co-authored-by: Lucain <[email protected]>
1 parent 556b825 commit 9daefee

33 files changed

+778
-168
lines changed
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
---
2+
description: In this section, we will provide a step-by-step guide to show how to import and export datasets to Python, local disk, or the Hugging Face Hub
3+
---
4+
5+
# Importing and exporting datasets and records
6+
7+
This guide provides an overview of how to import and export your dataset or its records to Python, your local disk, or the Hugging Face Hub.
8+
9+
In Argilla, you can import/export two main components of a dataset:
10+
- The dataset's complete configuration defined in `rg.Settings`. This is useful if your want to share your feedback task or restore it later in Argilla.
11+
- The records stored in the dataset, including `Metadata`, `Vectors`, `Suggestions`, and `Responses`. This is useful if you want to use your dataset's records outside of Argilla.
12+
13+
Check the [Dataset - Python Reference](../reference/argilla/datasets/dataset.md) to see the attributes, arguments, and methods of the export `Dataset` class in detail.
14+
15+
To import records to a dataset, used the `rg.Datasets.records.log` method. Their is a guide on how to do this in the [Record - Python Reference](record.md).
16+
17+
## Import and Export an `rg.Dataset` from Argilla
18+
19+
First, we will go through exporting a complete dataset from Argilla. This includes the dataset's setting and records. All of these methods use the `rg.Dataset.from_*` and `rg.Dataset.to_*` methods.
20+
21+
### Push an Argilla dataset to the Hugging Face Hub
22+
23+
You can push a dataset from Argilla to the Hugging Face Hub. This is useful if you want to share your dataset with the community or version control it. You can push the dataset to the Hugging Face Hub using the `rg.Dataset.to_hub` method.
24+
25+
```python
26+
import argilla as rg
27+
28+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
29+
dataset = client.datasets(name="my_dataset")
30+
dataset.to_hub(repo_id="<repo_id>")
31+
```
32+
33+
!!! note "With or without records"
34+
The example above will push the dataset's `Settings` and records to the hub. If you only want to push the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records.
35+
36+
```python
37+
dataset.to_hub(repo_id="<repo_id>", with_records=False)
38+
```
39+
40+
41+
### Pull an Argilla dataset from the Hugging Face Hub
42+
43+
You can pull a dataset from the Hugging Face Hub to Argilla. This is useful if you want to restore a dataset and its configuration. You can pull the dataset from the Hugging Face Hub using the `rg.Dataset.from_hub` method.
44+
45+
```python
46+
47+
import argilla as rg
48+
49+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
50+
dataset = rg.Dataset.from_hub(repo_id="<repo_id>")
51+
```
52+
53+
The `rg.Dataset.from_hub` method loads the configuration and records from the dataset repo. If you only want to load records, you can pass a `datasets.Dataset` object to the `rg.Dataset.log` method. This enables you to configure your own dataset and reuse existing Hub datasets. See the [guide on records](record.md) for more information.
54+
55+
!!! note "With or without records"
56+
57+
The example above will pull the dataset's `Settings` and records from the hub. If you only want to pull the dataset's configuration, you can set the `with_records` parameter to `False`. This is useful if you're just interested in a specific dataset template or you want to make changes in the dataset settings and/or records.
58+
59+
```python
60+
dataset = rg.Dataset.from_hub(repo_id="<repo_id>", with_records=False)
61+
```
62+
63+
With the dataset's configuration you could then make changes to the dataset. For example, you could adapt the dataset's settings for a different task:
64+
65+
```python
66+
dataset.settings.questions = [rg.TextQuestion(name="answer")]
67+
```
68+
69+
You could then log the dataset's records using the `load_dataset` method of the `datasets` package and pass the dataset to the `rg.Dataset.log` method.
70+
71+
```python
72+
hf_dataset = load_dataset("<repo_id>")
73+
dataset.log(hf_dataset)
74+
```
75+
76+
77+
78+
### Saving an Argilla dataset to local disk
79+
80+
You can save a dataset from Argilla to your local disk. This is useful if you want to back up your dataset. You can use the `rg.Dataset.to_disk` method.
81+
82+
```python
83+
import argilla as rg
84+
85+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
86+
dataset = client.datasets(name="my_dataset", workspace=workspace)
87+
88+
dataset.to_disk(path="path/to/dataset")
89+
```
90+
91+
This will save the dataset's configuration and records to the specified path. If you only want to save the dataset's configuration, you can set the `with_records` parameter to `False`.
92+
93+
```python
94+
dataset.to_disk(path="path/to/dataset", with_records=False)
95+
```
96+
97+
### Loading an Argilla dataset from local disk
98+
99+
You can load a dataset from your local disk to Argilla. This is useful if you want to restore a dataset's configuration. You can use the `rg.Dataset.from_disk` method.
100+
101+
```python
102+
import argilla as rg
103+
104+
dataset = rg.Dataset.from_disk(path="path/to/dataset")
105+
```
106+
107+
!!! note "Directing the dataset to a workspace and name"
108+
You can also specify the workspace and name of the dataset when loading it from the disk.
109+
110+
```python
111+
dataset = rg.Dataset.from_disk(path="path/to/dataset", target_workspace=workspace, target_name="my_dataset")
112+
```
113+
114+
## Export only records from Argilla Datasets
115+
116+
The records alone can be exported from a dataset in Argilla. This is useful if you want to process the records in Python, export them to a different platform, or use them in model training. All of these methods use the `rg.Dataset.records` attribute.
117+
118+
The records can be exported as a dictionary, a list of dictionaries, or to a `Dataset` of the `datasets` package.
119+
120+
121+
=== "To the `datasets` package"
122+
123+
124+
Records can be exported from `Dataset.records` to the `datasets` package. The `to_dataset` method can be used to export records to the `datasets` package. You can specify the name of the dataset and the split to export the records.
125+
126+
```python
127+
import argilla as rg
128+
129+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
130+
dataset = client.datasets(name="my_dataset")
131+
132+
# Export records as a dictionary
133+
exported_ds = dataset.records.to_datasets()
134+
```
135+
136+
=== "To a Python dictionary"
137+
138+
Records can be exported from `Dataset.records` as a dictionary. The `to_dict` method can be used to export records as a dictionary. You can specify the orientation of the dictionary output. You can also decide if to flatten or not the dictionary.
139+
140+
```python
141+
import argilla as rg
142+
143+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
144+
dataset = client.datasets(name="my_dataset")
145+
146+
# Export records as a dictionary
147+
exported_records = dataset.records.to_dict()
148+
# {'fields': [{'text': 'Hello'},{'text': 'World'}], suggestions': [{'label': {'value': 'positive'}}, {'label': {'value': 'negative'}}]
149+
150+
# Export records as a dictionary with orient=index
151+
exported_records = dataset.records.to_dict(orient="index")
152+
# {"uuid": {'fields': {'text': 'Hello'}, 'suggestions': {'label': {'value': 'positive'}}}, {"uuid": {'fields': {'text': 'World'}, 'suggestions': {'label': {'value': 'negative'}}},
153+
154+
# Export records as a dictionary with flatten=false
155+
exported_records = dataset.records.to_dict(flatten=True)
156+
# {"text": ["Hello", "World"], "label.suggestion": ["greeting", "greeting"]}
157+
```
158+
159+
=== "To a python list"
160+
161+
Records can be exported from `Dataset.records` as a list of dictionaries. The `to_list` method can be used to export records as a list of dictionaries. You can decide if to flatten it or not.
162+
163+
```python
164+
import argilla as rg
165+
166+
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
167+
168+
workspace = client.workspaces("my_workspace")
169+
170+
dataset = client.datasets(name="my_dataset", workspace=workspace)
171+
172+
# Export records as a list of dictionaries
173+
exported_records = dataset.records.to_list()
174+
# [{'fields': {'text': 'Hello'}, 'suggestion': {'label': {value: 'greeting'}}}, {'fields': {'text': 'World'}, 'suggestion': {'label': {value: 'greeting'}}}]
175+
176+
# Export records as a list of dictionaries with flatten=False
177+
exported_records = dataset.records.to_list(flatten=True)
178+
# [{"text": "Hello", "label": "greeting"}, {"text": "World", "label": "greeting"}]
179+
```

argilla/docs/how_to_guides/index.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,13 +43,21 @@ These guides provide step-by-step instructions for common scenarios, including d
4343

4444
[:octicons-arrow-right-24: How-to guide](record.md)
4545

46-
- __Query, filter and export records__
46+
- __Query and filter a dataset__
4747

4848
---
4949

50-
Learn how to query and filter a `Dataset` and export their `Records`.
50+
Learn how to query and filter a `Dataset`.
5151

52-
[:octicons-arrow-right-24: How-to guide](query_export.md)
52+
[:octicons-arrow-right-24: How-to guide](query.md)
53+
54+
- __Importing and exporting datasets and records__
55+
56+
---
57+
58+
Learn how to export your dataset or its records to Python, your local disk, or the Hugging face Hub.
59+
60+
[:octicons-arrow-right-24: How-to guide](import_export.md)
5361

5462
</div>
5563

argilla/docs/how_to_guides/query_export.md renamed to argilla/docs/how_to_guides/query.md

Lines changed: 2 additions & 52 deletions
Original file line numberDiff line numberDiff line change
@@ -2,9 +2,9 @@
22
description: In this section, we will provide a step-by-step guide to show how to filter and query a dataset.
33
---
44

5-
# Query, filter, and export records
5+
# Query and filter records
66

7-
This guide provides an overview of how to query and filter a dataset in Argilla and export records.
7+
This guide provides an overview of how to query and filter a dataset in Argilla.
88

99
You can search for records in your dataset by **querying** or **filtering**. The query focuses on the content of the text field, while the filter is used to filter the records based on conditions. You can use them independently or combine multiple filters to create complex search queries. You can also export records from a dataset either as a single dictionary or a list of dictionaries.
1010

@@ -170,53 +170,3 @@ queried_filtered_records = list(dataset.records(
170170
)
171171
)
172172
```
173-
174-
## Export records to a dictionary
175-
176-
Records can be exported from `Dataset.records` as a dictionary. The `to_dict` method can be used to export records as a dictionary. You can specify the orientation of the dictionary output. You can also decide if to flatten or not the dictionary.
177-
178-
=== "
179-
180-
```python
181-
import argilla as rg
182-
183-
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
184-
185-
workspace = client.workspaces("my_workspace")
186-
187-
dataset = client.datasets(name="my_dataset", workspace=workspace)
188-
189-
# Export records as a dictionary
190-
exported_records = dataset.records.to_dict()
191-
# {'fields': [{'text': 'Hello'},{'text': 'World'}], suggestions': [{'label': {'value': 'positive'}}, {'label': {'value': 'negative'}}]
192-
193-
# Export records as a dictionary with orient=index
194-
exported_records = dataset.records.to_dict(orient="index")
195-
# {"uuid": {'fields': {'text': 'Hello'}, 'suggestions': {'label': {'value': 'positive'}}}, {"uuid": {'fields': {'text': 'World'}, 'suggestions': {'label': {'value': 'negative'}}},
196-
197-
# Export records as a dictionary with flatten=false
198-
exported_records = dataset.records.to_dict(flatten=True)
199-
# {"text": ["Hello", "World"], "label.suggestion": ["greeting", "greeting"]}
200-
```
201-
202-
## Export records to a list
203-
204-
Records can be exported from `Dataset.records` as a list of dictionaries. The `to_list` method can be used to export records as a list of dictionaries. You can decide if to flatten it or not.
205-
206-
```python
207-
import argilla as rg
208-
209-
client = rg.Argilla(api_url="<api_url>", api_key="<api_key>")
210-
211-
workspace = client.workspaces("my_workspace")
212-
213-
dataset = client.datasets(name="my_dataset", workspace=workspace)
214-
215-
# Export records as a list of dictionaries
216-
exported_records = dataset.records.to_list()
217-
# [{'fields': {'text': 'Hello'}, 'suggestion': {'label': {value: 'greeting'}}}, {'fields': {'text': 'World'}, 'suggestion': {'label': {value: 'greeting'}}}]
218-
219-
# Export records as a list of dictionaries with flatten=False
220-
exported_records = dataset.records.to_list(flatten=True)
221-
# [{"text": "Hello", "label": "greeting"}, {"text": "World", "label": "greeting"}]
222-
```

argilla/docs/how_to_guides/record.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -485,7 +485,7 @@ dataset.records.delete(records=records_to_delete)
485485
!!! tip "Delete records based on a query"
486486
It can be very useful to avoid eliminating records with responses.
487487

488-
> For more information about the query syntax, check this [how-to guide](query_export.md).
488+
> For more information about the query syntax, check this [how-to guide](query.md).
489489

490490
```python
491491
status_filter = rg.Query(

argilla/docs/reference/argilla/client.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -44,10 +44,9 @@ for dataset in my_workspace.datasets:
4444

4545
---
4646

47-
## Class Reference
48-
49-
### `rg.Argilla`
47+
## `rg.Argilla`
5048

5149
::: src.argilla.client.Argilla
5250
options:
5351
heading_level: 3
52+
show_root_toc_entry: false

argilla/docs/reference/argilla/datasets/dataset_records.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -215,10 +215,9 @@ Check out the [`rg.Record`](../records/records.md) class reference for more info
215215

216216
---
217217

218-
## Class Reference
219-
220-
### `rg.Dataset.records`
218+
## `rg.Dataset.records`
221219

222220
::: src.argilla.records._dataset_records.DatasetRecords
223221
options:
224-
heading_level: 3
222+
heading_level: 3
223+
show_root_toc_entry: false

argilla/docs/reference/argilla/datasets/datasets.md

Lines changed: 15 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -39,10 +39,21 @@ dataset = client.datasets("my_dataset")
3939

4040
---
4141

42-
## Class Reference
43-
44-
### `rg.Dataset`
42+
## `rg.Dataset`
4543

4644
::: src.argilla.datasets._resource.Dataset
4745
options:
48-
heading_level: 3
46+
heading_level: 3
47+
show_root_toc_entry: false
48+
49+
::: src.argilla.datasets._export._disk.DiskImportExportMixin
50+
options:
51+
heading_level: 3
52+
show_root_heading: false
53+
show_root_toc_entry: false
54+
55+
::: src.argilla.datasets._export._hub.HubImportExportMixin
56+
options:
57+
heading_level: 3
58+
show_root_heading: false
59+
show_root_toc_entry: false

argilla/docs/reference/argilla/records/records.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -52,10 +52,9 @@ For changes to take effect, the user must call the `update` method on the `Datas
5252

5353
---
5454

55-
## Class Reference
56-
57-
### `rg.Record`
55+
## `rg.Record`
5856

5957
::: src.argilla.records._resource.Record
6058
options:
61-
heading_level: 3
59+
heading_level: 3
60+
show_root_toc_entry: false

argilla/docs/reference/argilla/records/responses.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,9 @@ for record in dataset.records:
6767

6868
---
6969

70-
## Class Reference
71-
72-
### `rg.Response`
70+
## `rg.Response`
7371

7472
::: src.argilla.responses.Response
7573
options:
76-
heading_level: 3
74+
heading_level: 3
75+
show_root_toc_entry: false

argilla/docs/reference/argilla/records/suggestions.md

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -82,10 +82,9 @@ for record in dataset.records(with_suggestions=True):
8282

8383
---
8484

85-
## Class Reference
86-
87-
### `rg.Suggestion`
85+
## `rg.Suggestion`
8886

8987
::: src.argilla.suggestions.Suggestion
9088
options:
91-
heading_level: 3
89+
heading_level: 3
90+
show_root_toc_entry: false

0 commit comments

Comments
 (0)