Skip to content

Commit 39779c1

Browse files
burtenshawfrascuchondavidberenstein1957
authored
[DOCS] add migration notebook to docs format (#5002)
This PR transfer the notebook guide by @frascuchon and @nataliaElv over to the documentation format, which makes it easier to follow. You can find the page here: http://localhost:8000/argilla-python/guides/how_to_guides/migrate_from_legacy_datasets/ n.b. the page is not available via the menu. @davidberenstein1957 we should figure out navigation presentation for this page. The guide looks like this: ![image](https://github.com/argilla-io/argilla/assets/19620375/e4e00da8-6117-4f45-9663-1a8c8be9e987) --------- Co-authored-by: Francisco Aranda <[email protected]> Co-authored-by: David Berenstein <[email protected]> Co-authored-by: Ben Burtenshaw <[email protected]>
1 parent 0348f4e commit 39779c1

File tree

1 file changed

+259
-0
lines changed

1 file changed

+259
-0
lines changed
Lines changed: 259 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,260 @@
11
# Migrate your legacy datasets to Argilla V2
2+
3+
This guide will help you migrate task specific datasets to Argilla V2. These do not include the `FeedbackDataset` which is just an interim naming convention for the latest extensible dataset. Task specific datasets are datasets that are used for a specific task, such as text classification, token classification, etc. If you would like to learn about the backstory of SDK this migration, please refer to the [SDK migration blog post](https://argilla.io/blog/introducing-argilla-new-sdk/).
4+
5+
!!! note
6+
Legacy Datasets include: `DatasetForTextClassification`, `DatasetForTokenClassification`, and `DatasetForText2Text`.
7+
8+
`FeedbackDataset`'s do not need to be migrated as they are already in the Argilla V2 format.
9+
10+
To follow this guide, you will need to have the following prerequisites:
11+
12+
- An argilla 1.* server instance running with legacy datasets.
13+
- An argilla >=1.29 server instance running. If you don't have one, you can create one by following the [Argilla installation guide](../../getting_started/installation.md).
14+
- The `argilla` sdk package installed in your environment.
15+
16+
If your current legacy datasets are on a server with Argilla release after 1.29, you could chose to recreate your legacy datasets as new datasets on the same server. You could then upgrade the server to Argilla 2.0 and carry on working their. Your legacy datasets will not be visible on the new server, but they will remain in storage layers if you need to access them.
17+
18+
## Steps
19+
20+
The guide will take you through three steps:
21+
22+
1. **Retrieve the legacy dataset** from the Argilla V1 server using the new `argilla` package.
23+
2. **Define the new dataset** in the Argilla V2 format.
24+
3. **Upload the dataset records** to the new Argilla V2 dataset format and attributes.
25+
26+
27+
### Step 1: Retrieve the legacy dataset
28+
29+
Connect to the Argilla V1 server via the new `argilla` package. The new sdk contains a `v1` module that allows you to connect to the Argilla V1 server:
30+
31+
```python
32+
import argilla.v1 as rg_v1
33+
34+
# Initialize the API with an Argilla server less than 2.0
35+
api_url = "<your-url>"
36+
api_key = "<your-api-key>"
37+
rg_v1.init(api_url, api_key)
38+
```
39+
40+
Next, load the dataset settings and records from the Argilla V1 server:
41+
42+
```python
43+
dataset_name = "news-programmatic-labeling"
44+
workspace = "demo"
45+
46+
settings_v1 = rg_v1.load_dataset_settings(dataset_name, workspace)
47+
records_v1 = rg_v1.load(dataset_name, workspace)
48+
hf_dataset = records_v1.to_datasets()
49+
```
50+
51+
Your legacy dataset is now loaded into the `hf_dataset` object.
52+
53+
### Step 2: Define the new dataset
54+
55+
Define the new dataset in the Argilla V2 format. The new dataset format is defined in the `argilla` package. You can create a new dataset with the `Settings` and `Dataset` classes:
56+
57+
First, instantiate the `Argilla` class to connect to the Argilla V2 server:
58+
59+
```python
60+
import argilla as rg
61+
62+
client = rg.Argilla()
63+
```
64+
65+
Next, define the new dataset settings:
66+
67+
```python
68+
settings = rg.Settings(
69+
fields=[
70+
rg.TextField(name="text"), # (1)
71+
],
72+
questions=[
73+
rg.LabelQuestion(name="label", labels=settings_v1.label_schema), # (2)
74+
],
75+
metadata=[
76+
rg.TermsMetadataProperty(name="split"), # (3)
77+
],
78+
vectors=[
79+
rg.VectorField(name='mini-lm-sentence-transformers', dimensions=384), # (4)
80+
],
81+
)
82+
```
83+
84+
1. The default name for text classification is `text`, but we should provide all names included in `record.inputs`.
85+
86+
2. The basis question for text classification is a `LabelQuestion` for single-label or `MultiLabelQuestion` for multi-label classification.
87+
88+
3. Here, we need to provide all relevant metadata fields.
89+
90+
4. The vectors fields available in the dataset.
91+
92+
Finally, create the new dataset on the Argilla V2 server:
93+
94+
```python
95+
dataset = rg.Dataset(name=dataset_name, settings=settings)
96+
dataset.create()
97+
```
98+
99+
!!! note
100+
If a dataset with the same name already exists, the `create` method will raise an exception. You can check if the dataset exists and delete it before creating a new one.
101+
102+
```python
103+
dataset = client.datasets(name=dataset_name)
104+
105+
if dataset.exists():
106+
dataset.delete()
107+
```
108+
109+
### Step 3: Upload the dataset records
110+
111+
To upload the records to the new server, we will need to convert the records from the Argilla V1 format to the Argilla V2 format. The new `argilla` sdk package uses a generic `Record` class, but legacy datasets have specific record classes. We will need to convert the records to the generic `Record` class.
112+
113+
Here are a set of example functions to convert the records for single-label and multi-label classification. You can modify these functions to suit your dataset.
114+
115+
=== "For single-label classification"
116+
117+
```python
118+
def map_to_record_for_single_label(data: dict, users_by_name: dict, current_user: rg.User) -> rg.Record:
119+
""" This function maps a text classification record dictionary to the new Argilla record."""
120+
suggestions = []
121+
responses = []
122+
123+
if prediction := data.get("prediction"):
124+
label, score = prediction[0].values()
125+
agent = data["prediction_agent"]
126+
suggestions.append(rg.Suggestion(question_name="label", value=label, score=score, agent=agent))
127+
128+
if annotation := data.get("annotation"):
129+
user_id = users_by_name.get(data["annotation_agent"], current_user).id
130+
responses.append(rg.Response(question_name="label", value=annotation, user_id=user_id))
131+
132+
vectors = (data.get("vectors") or {})
133+
return rg.Record(
134+
id=data["id"],
135+
fields=data["inputs"],
136+
# The inputs field should be a dictionary with the same keys as the `fields` in the settings
137+
metadata=data["metadata"],
138+
# The metadata field should be a dictionary with the same keys as the `metadata` in the settings
139+
vectors=[rg.Vector(name=name, values=value) for name, value in vectors.items()],
140+
suggestions=suggestions,
141+
responses=responses,
142+
)
143+
```
144+
145+
=== "For multi-label classification"
146+
147+
```python
148+
def map_to_record_for_multi_label(data: dict, users_by_name: dict, current_user: rg.User) -> rg.Record:
149+
""" This function maps a text classification record dictionary to the new Argilla record."""
150+
suggestions = []
151+
responses = []
152+
153+
if prediction := data.get("prediction"):
154+
labels, scores = zip(*[(pred["label"], pred["score"]) for pred in prediction])
155+
agent = data["prediction_agent"]
156+
suggestions.append(rg.Suggestion(question_name="labels", value=labels, score=scores, agent=agent))
157+
158+
if annotation := data.get("annotation"):
159+
user_id = users_by_name.get(data["annotation_agent"], current_user).id
160+
responses.append(rg.Response(question_name="label", value=annotation, user_id=user_id))
161+
162+
vectors = data.get("vectors") or {}
163+
return rg.Record(
164+
id=data["id"],
165+
fields=data["inputs"],
166+
# The inputs field should be a dictionary with the same keys as the `fields` in the settings
167+
metadata=data["metadata"],
168+
# The metadata field should be a dictionary with the same keys as the `metadata` in the settings
169+
vectors=[rg.Vector(name=name, values=value) for name, value in vectors.items()],
170+
suggestions=suggestions,
171+
responses=responses,
172+
)
173+
```
174+
175+
=== "For token classification"
176+
177+
```python
178+
def map_to_record_for_span(data: dict, users_by_name: dict, current_user: rg.User) -> rg.Record:
179+
""" This function maps a token classification record dictionary to the new Argilla record."""
180+
suggestions = []
181+
responses = []
182+
183+
if prediction := data.get("prediction"):
184+
scores = [span["score"] for span in prediction]
185+
agent = data["prediction_agent"]
186+
suggestions.append(rg.Suggestion(question_name="spans", value=prediction, score=scores, agent=agent))
187+
188+
if annotation := data.get("annotation"):
189+
user_id = users_by_name.get(data["annotation_agent"], current_user).id
190+
responses.append(rg.Response(question_name="spans", value=annotation, user_id=user_id))
191+
192+
vectors = data.get("vectors") or {}
193+
return rg.Record(
194+
id=data["id"],
195+
fields={"text": data["text"]},
196+
# The inputs field should be a dictionary with the same keys as the `fields` in the settings
197+
metadata=data["metadata"],
198+
# The metadata field should be a dictionary with the same keys as the `metadata` in the settings
199+
vectors=[rg.Vector(name=name, values=value) for name, value in vectors.items()],
200+
# The vectors field should be a dictionary with the same keys as the `vectors` in the settings
201+
suggestions=suggestions,
202+
responses=responses,
203+
)
204+
```
205+
206+
=== "For Text generation"
207+
208+
```python
209+
def map_to_record_for_text_generation(data: dict, users_by_name: dict, current_user: rg.User) -> rg.Record:
210+
""" This function maps a text2text record dictionary to the new Argilla record."""
211+
suggestions = []
212+
responses = []
213+
214+
if prediction := data.get("prediction"):
215+
first = prediction[0]
216+
agent = data["prediction_agent"]
217+
suggestions.append(
218+
rg.Suggestion(question_name="text_generation", value=first["text"], score=first["score"], agent=agent)
219+
)
220+
221+
if annotation := data.get("annotation"):
222+
# From data[annotation]
223+
user_id = users_by_name.get(data["annotation_agent"], current_user).id
224+
responses.append(rg.Response(question_name="text_generation", value=annotation, user_id=user_id))
225+
226+
vectors = (data.get("vectors") or {})
227+
return rg.Record(
228+
id=data["id"],
229+
fields={"text": data["text"]},
230+
# The inputs field should be a dictionary with the same keys as the `fields` in the settings
231+
metadata=data["metadata"],
232+
# The metadata field should be a dictionary with the same keys as the `metadata` in the settings
233+
vectors=[rg.Vector(name=name, values=value) for name, value in vectors.items()],
234+
# The vectors field should be a dictionary with the same keys as the `vectors` in the settings
235+
suggestions=suggestions,
236+
responses=responses,
237+
)
238+
```
239+
240+
The functions above depend on the `users_by_name` dictionary and the `current_user` object to assign responses to users, we need to load the existing users. You can retrieve the users from the Argilla V2 server and the current user as follows:
241+
242+
```python
243+
# For
244+
users_by_name = {user.username: user for user in client.users}
245+
current_user = client.me
246+
```
247+
248+
Finally, upload the records to the new dataset using the `log` method and map functions.
249+
250+
```python
251+
records = []
252+
253+
for data in hf_records:
254+
records.append(map_to_record_for_single_label(data, users_by_name, current_user))
255+
256+
# Upload the records to the new dataset
257+
dataset.records.log(records)
258+
```
259+
You have now successfully migrated your legacy dataset to Argilla V2. For more guides on how to use the Argilla SDK, please refer to the [How to guides](index.md).
260+

0 commit comments

Comments
 (0)