Skip to content

Commit bf2ec3e

Browse files
authored
Add data tutorial (#53)
1 parent 55b3dd6 commit bf2ec3e

File tree

3 files changed

+246
-1
lines changed

3 files changed

+246
-1
lines changed

docs/source/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,7 @@ def setup(app) -> None: # noqa: ANN001
9494
("tutorials.pipeline_optimization", "Pipeline Optimization"),
9595
("tutorials.modules.scoring", "Scoring Modules", [("linear", "Linear Scorer")]),
9696
("tutorials.modules.prediction", "Prediction Modules", [("argmax", "Argmax Predictor")]),
97+
("tutorials.data", "Data"),
9798
],
9899
)
99100
app.connect("autodoc-skip-member", skip_member)

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ ignore = [
9797
"autointent/modules/*" = ["ARG002", "ARG003"] # unused argument
9898
"docs/*" = ["INP001", "A001", "D"]
9999
"*/utils.py" = ["D104", "D100"]
100-
"tutorials/*" = ["INP001", "T", "D"]
100+
"tutorials/*" = ["B018", "E501", "INP001", "T", "D"]
101101

102102
[tool.ruff.lint.pylint]
103103
max-args = 10

tutorials/data.py

Lines changed: 244 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,244 @@
1+
# %% [markdown]
2+
"""
3+
# Working with data
4+
"""
5+
6+
# %%
7+
import importlib.resources as ires
8+
9+
import datasets
10+
11+
from autointent.context.data_handler import Dataset
12+
13+
# %%
14+
datasets.logging.disable_progress_bar() # disable tqdm outputs
15+
16+
# %% [markdown]
17+
"""
18+
## Creating a dataset
19+
20+
To create a dataset, you need to provide a training split containing samples with utterances and labels, as shown below:
21+
22+
```json
23+
{
24+
"train": [
25+
{
26+
"utterance": "Hello!",
27+
"label": 0
28+
},
29+
...
30+
]
31+
}
32+
```
33+
34+
For a multilabel dataset, the `label` field should be a list of integers representing the corresponding class labels.
35+
36+
### Handling out-of-scope (OOS) samples
37+
38+
To indicate that a sample is out-of-scope (OOS), omit the `label` field from the sample dictionary. For example:
39+
40+
```json
41+
{
42+
"train": [
43+
{
44+
"utterance": "OOS request"
45+
},
46+
...
47+
]
48+
}
49+
```
50+
51+
### Validation and test splits
52+
53+
By default, a portion of the training split will be allocated for validation and testing.
54+
However, you can also specify a test split explicitly:
55+
56+
```json
57+
{
58+
"train": [
59+
{
60+
"utterance": "Hello!",
61+
"label": 0
62+
},
63+
...
64+
],
65+
"test": [
66+
{
67+
"utterance": "Hi!",
68+
"label": 0
69+
},
70+
...
71+
]
72+
}
73+
```
74+
75+
### Adding metadata to intents
76+
77+
You can add metadata to intents in your dataset, such as
78+
regular expressions, intent names, descriptions, or tags, using the `intents` field:
79+
80+
```json
81+
{
82+
"train": [
83+
{
84+
"utterance": "Hello!",
85+
"label": 0
86+
},
87+
...
88+
],
89+
"intents": [
90+
{
91+
"id": 0,
92+
"name": "greeting",
93+
"tags": ["conversation_start"],
94+
"regexp_partial_match": ["\bhello\b"],
95+
"regexp_full_match": ["^hello$"],
96+
"description": "User wants to initiate a conversation with a greeting."
97+
}
98+
...
99+
]
100+
}
101+
```
102+
103+
- `name`: A human-readable representation of the intent.
104+
- `tags`: Used in multilabel scenarios to predict the most probable class listed in a specific tag.
105+
- `regexp_partial_match` and `regexp_full_match`: Used by the `RegExp` module to predict intents based on provided patterns.
106+
- `description`: Used by the `DescriptionScorer` to calculate scores based on the similarity between an utterance and intent descriptions.
107+
108+
All fields in the `intents` list are optional except for `id`.
109+
"""
110+
111+
# %% [markdown]
112+
"""
113+
## Loading a dataset
114+
115+
There are three main ways to load your dataset:
116+
117+
1. From a Python dictionary.
118+
2. From a JSON file.
119+
3. Directly from the Hugging Face Hub.
120+
"""
121+
122+
# %% [markdown]
123+
"""
124+
### Creating a dataset from a Python dictionary
125+
"""
126+
127+
# %%
128+
dataset = Dataset.from_dict(
129+
{
130+
"train": [
131+
{
132+
"utterance": "Please help me with my card. It won't activate.",
133+
"label": 0,
134+
},
135+
{
136+
"utterance": "I tried but am unable to activate my card.",
137+
"label": 0,
138+
},
139+
{
140+
"utterance": "I want to open an account for my children.",
141+
"label": 1,
142+
},
143+
{
144+
"utterance": "How old do you need to be to use the bank's services?",
145+
"label": 1,
146+
},
147+
],
148+
"test": [
149+
{
150+
"utterance": "I want to start using my card.",
151+
"label": 0,
152+
},
153+
{
154+
"utterance": "How old do I need to be?",
155+
"label": 1,
156+
},
157+
],
158+
"intents": [
159+
{
160+
"id": 0,
161+
"name": "activate_my_card",
162+
},
163+
{
164+
"id": 1,
165+
"name": "age_limit",
166+
},
167+
],
168+
},
169+
)
170+
171+
# %% [markdown]
172+
"""
173+
### Loading a dataset from a file
174+
175+
The AutoIntent library includes sample datasets.
176+
For example, you can load the `autointent/datafiles/dstc3-20shot.json` file like this:
177+
"""
178+
179+
# %%
180+
dataset = Dataset.from_json(
181+
ires.files("autointent.datafiles").joinpath("dstc3-20shot.json"),
182+
)
183+
184+
# %% [markdown]
185+
"""
186+
### Loading a dataset from the Hugging Face Hub
187+
188+
If your dataset on the Hugging Face Hub matches the required format, you can load it directly using its repository ID:
189+
"""
190+
191+
# %%
192+
# dataset = Dataset.from_datasets("<repo_id>")
193+
194+
# %% [markdown]
195+
"""
196+
### Accessing dataset splits
197+
"""
198+
199+
# %%
200+
dataset = Dataset.from_json(
201+
ires.files("autointent.datafiles").joinpath("banking77.json"),
202+
)
203+
204+
# %% [markdown]
205+
"""
206+
The `Dataset` class organizes your data as a dictionary of splits (`str: datasets.Dataset`).
207+
For example, after initialization, an `oos` key may be added if OOS samples are provided.
208+
In the case of the `banking77` dataset, only the `train` split is available, which you can access as shown below:
209+
"""
210+
211+
# %%
212+
dataset["train"]
213+
214+
# %% [markdown]
215+
"""
216+
### Working with dataset splits
217+
218+
Each split in the `Dataset` class is an instance of [datasets.Dataset](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.Dataset),
219+
so you can work with them accordingly.
220+
"""
221+
222+
# %%
223+
dataset["train"][:5] # get first 5 train samples
224+
225+
# %% [markdown]
226+
"""
227+
### Working with intents
228+
229+
Metadata that you added to intents in your dataset is stored in `intents: list[Intent]` attribute.
230+
"""
231+
232+
# %%
233+
dataset.intents[0] # get intent (id=0)
234+
235+
# %% [markdown]
236+
"""
237+
### Pushing a dataset to the Hugging Face Hub
238+
239+
To share your dataset on the Hugging Face Hub, use the `push_to_hub` method.
240+
Ensure that you are logged in using the `huggingface-cli` tool:
241+
"""
242+
243+
# %%
244+
# dataset.push_to_hub("<repo_id>")

0 commit comments

Comments
 (0)