Skip to content

Commit b57ccad

Browse files
Ankur-singhjimmytweixmnboy
authored
fixed HF load_dataset (#2392)
* Update README.md * Update README.md * Update README.md * Update lu_solve_omp_offload_optimized.F90 * Update openmp_sample.f90 (#2371) * fixed HF load_dataset * Update README.md to reflect changes in compiler output. (#2402) This update contains a rewritten README.md file for the `guided_matrix_mult_InvalidContexts` sample. It also adds two screenshots (png files) that accompany the rewritten README.me file. The previous version of this README.md did not include screenshots --------- Co-authored-by: Jimmy Wei <[email protected]> Co-authored-by: Paul Fischer <[email protected]>
1 parent 521f905 commit b57ccad

File tree

5 files changed

+338
-162
lines changed

5 files changed

+338
-162
lines changed

AI-and-Analytics/Features-and-Functionality/INC_QuantizationAwareTraining_TextClassification/INC_QuantizationAwareTraining_TextClassification.ipynb

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@
7575
"source": [
7676
"from datasets import load_dataset\n",
7777
"\n",
78-
"dataset = load_dataset(\"emotion\", name=\"split\")\n",
78+
"dataset = load_dataset(\"emotion\", name=\"split\", trust_remote_code=True)\n",
7979
"dataset['train'][:10]"
8080
]
8181
},

AI-and-Analytics/Features-and-Functionality/INC_QuantizationAwareTraining_TextClassification/INC_QuantizationAwareTraining_TextClassification.py

Lines changed: 64 additions & 54 deletions
Original file line numberDiff line numberDiff line change
@@ -6,21 +6,21 @@
66

77
# =============================================================
88
# Copyright © 2023 Intel Corporation
9-
#
9+
#
1010
# SPDX-License-Identifier: MIT
1111
# =============================================================
1212

1313

1414
# # Fine-tuning text classification model with Intel® Neural Compressor (INC) Quantization Aware Training
15-
#
15+
#
1616
# This code sample will show you how to fine tune BERT text model for text multi-class classification task using Quantization Aware Training provided as part of Intel® Neural Compressor (INC).
17-
#
17+
#
1818
# Before we start, please make sure you have installed all necessary libraries to run this code sample.
1919

20-
# ## Loading model
21-
#
22-
# We decided to use really small model for this code sample which is `prajjwal1/bert-tiny` but please feel free to use different model changing `model_id` to other name form Hugging Face library or your local model path (if it is compatible with Hugging Face API).
23-
#
20+
# ## Loading model
21+
#
22+
# We decided to use really small model for this code sample which is `prajjwal1/bert-tiny` but please feel free to use different model changing `model_id` to other name form Hugging Face library or your local model path (if it is compatible with Hugging Face API).
23+
#
2424
# Keep in mind that using bigger models like `bert-base-uncased` can improve the final result of the classification after fine-tuning process but it is also really resources and time consuming.
2525

2626
# In[ ]:
@@ -29,42 +29,42 @@
2929
from transformers import AutoModelForSequenceClassification, AutoTokenizer
3030

3131
model_id = "prajjwal1/bert-tiny"
32-
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=6)
32+
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=6)
3333
tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=512)
3434

3535
# The directory where the quantized model will be saved
3636
save_dir = "quantized_model"
3737

3838

3939
# ## Dataset
40-
#
41-
# We are using `emotion` [dataset form Hugging Face](https://huggingface.co/datasets/dair-ai/emotion). This dataset has 2 different configurations - **split** and **unsplit**.
42-
#
40+
#
41+
# We are using `emotion` [dataset form Hugging Face](https://huggingface.co/datasets/dair-ai/emotion). This dataset has 2 different configurations - **split** and **unsplit**.
42+
#
4343
# In this code sample we are using split configuration. It contains in total 20 000 examples split into train (16 000 texts), test (2 000 texts) and validation (2 000 text) datasets. We decided to use split dataset instead of unsplit configuration as it contains over 400 000 texts which is overkill for fine-tuning.
44-
#
44+
#
4545
# After loading selected dataset we will take a look at first 10 rows of train dataset. You can always change the dataset for different one, just remember to change also number of labels parameter provided when loading the model.
4646

4747
# In[ ]:
4848

4949

5050
from datasets import load_dataset
5151

52-
dataset = load_dataset("emotion", name="split")
53-
dataset['train'][:10]
52+
dataset = load_dataset("emotion", name="split", trust_remote_code=True)
53+
dataset["train"][:10]
5454

5555

5656
# Dataset contains 6 different labels represented by digits from 0 to 5. Every digit symbolizes different emotion as followed:
57-
#
57+
#
5858
# * 0 - sadness
5959
# * 1 - joy
6060
# * 2 - love
6161
# * 3 - anger
6262
# * 4 - fear
6363
# * 5 - surprise
64-
#
64+
#
6565
# In the cell below we conducted few computations on training dataset to better understand how the data looks like. We are analyzing only train dataset as the test and validation datasets have similar data distribution.
66-
#
67-
# As you can see, distribution opf classed in dataset is not equal. Having in mind that the train, test and validation distributions are similar this is not a problem for our case.
66+
#
67+
# As you can see, distribution opf classed in dataset is not equal. Having in mind that the train, test and validation distributions are similar this is not a problem for our case.
6868

6969

7070
# In[ ]:
@@ -73,45 +73,46 @@
7373
import matplotlib.pyplot as plt
7474
import pandas as pd
7575

76-
sadness = dataset['train']['label'].count(0)
77-
joy = dataset['train']['label'].count(1)
78-
love = dataset['train']['label'].count(2)
79-
anger = dataset['train']['label'].count(3)
80-
fear = dataset['train']['label'].count(4)
81-
surprise = dataset['train']['label'].count(5)
76+
sadness = dataset["train"]["label"].count(0)
77+
joy = dataset["train"]["label"].count(1)
78+
love = dataset["train"]["label"].count(2)
79+
anger = dataset["train"]["label"].count(3)
80+
fear = dataset["train"]["label"].count(4)
81+
surprise = dataset["train"]["label"].count(5)
8282

8383

8484
fig = plt.figure()
85-
ax = fig.add_axes([0,0,1,1])
86-
labels = ['joy', 'sadness', 'anger', 'fear', 'love', 'surprise']
85+
ax = fig.add_axes([0, 0, 1, 1])
86+
labels = ["joy", "sadness", "anger", "fear", "love", "surprise"]
8787
frames = [joy, sadness, anger, fear, love, surprise]
8888
ax.bar(labels, frames)
8989
plt.show()
9090

9191

9292
# # Tokenization
93-
#
94-
# Next step is to tokenize the dataset.
95-
#
96-
# **Tokenization** is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters etc. It means that tokenizer breaks unstructured data (natural language text) into chunks of information that can be considered as discrete elements. The tokens can be used later in a vector representation of that document.
97-
#
98-
# In other words tokenization change an text document into a numerical data structure suitable for machine and deep learning.
99-
#
93+
#
94+
# Next step is to tokenize the dataset.
95+
#
96+
# **Tokenization** is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters etc. It means that tokenizer breaks unstructured data (natural language text) into chunks of information that can be considered as discrete elements. The tokens can be used later in a vector representation of that document.
97+
#
98+
# In other words tokenization change an text document into a numerical data structure suitable for machine and deep learning.
99+
#
100100
# To do that, we created function that takes every text from dataset and tokenize it with maximum token length being 128. After that we can se how the structure of the dataset change.
101101

102102

103103
# In[ ]:
104104

105105

106106
def tokenize_data(example):
107-
return tokenizer(example['text'], padding='max_length', max_length=128)
107+
return tokenizer(example["text"], padding="max_length", max_length=128)
108+
108109

109110
dataset = dataset.map(tokenize_data, batched=True)
110111
dataset
111112

112113

113114
# Before we start fine-tuning, let's see how the model in current state performs against validation dataset.
114-
#
115+
#
115116
# First, we need to prepare metrics showing model performance. We decided to use accuracy as a performance measure in this specific task. As the model was not created for this specific task, we can assume that the accuracy will not be high.
116117

117118

@@ -123,6 +124,7 @@ def tokenize_data(example):
123124

124125
metric = evaluate.load("accuracy")
125126

127+
126128
def compute_metrics(eval_pred):
127129
logits, labels = eval_pred
128130
predictions = np.argmax(logits, axis=-1)
@@ -135,7 +137,7 @@ def compute_metrics(eval_pred):
135137
# * dataset - in our case this is validation dataset,
136138
# * metrics - as specified before, in our case accuracy,
137139
# * label mapping - to map label names with corresponding digits.
138-
#
140+
#
139141
# After the evaluation, we just show the results, which are as expected not the best. At this point model is not prepared for emotion classification task.
140142

141143

@@ -148,21 +150,28 @@ def compute_metrics(eval_pred):
148150

149151
eval_results = task_evaluator.compute(
150152
model_or_pipeline=model_id,
151-
data=dataset['validation'],
153+
data=dataset["validation"],
152154
metric=metric,
153-
label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4, "LABEL_5": 5}
155+
label_mapping={
156+
"LABEL_0": 0,
157+
"LABEL_1": 1,
158+
"LABEL_2": 2,
159+
"LABEL_3": 3,
160+
"LABEL_4": 4,
161+
"LABEL_5": 5,
162+
},
154163
)
155164
eval_results
156165

157166

158167
# # Quantization Aware Training
159-
#
168+
#
160169
# Now, we can move to fine-tuning with quantization. But first, please review the definition of quantization and quantization aware training.
161-
#
170+
#
162171
# **Quantization** is a systematic reduction of the precision of all or several layers within the model. This means, a higher-precision type, such as the single precision floating-point (FP32) is converted into a lower-precision type, such as FP16 (16 bits) or INT8 (8 bits).
163-
#
172+
#
164173
# **Quantization Aware Training** replicates inference-time quantization, resulting in a model that downstream tools may utilize to generate actually quantized models. In other words, it provides quantization to the model during training (or fine-tuning like in our case) based on provided quantization configuration.
165-
#
174+
#
166175
# Having that in mind, we can provide configuration for the Quantization Aware Training form Intel® Neural Compressor.
167176

168177

@@ -177,13 +186,13 @@ def compute_metrics(eval_pred):
177186

178187
# The next step is to create trainer for our model. We will use Intel® Neural Compressor optimize trainer form `optimum.intel` package.
179188
# We need to provide all necessary parameters to the trainer:
180-
#
189+
#
181190
# * initialized model and tokenizer
182191
# * configuration for quantization aware training
183192
# * training arguments that includes: directory where model will be saved, number of epochs
184193
# * datasets for training and evaluation
185194
# * prepared metrics that allow us to see the progress in training
186-
#
195+
#
187196
# For purpose of this code sample, we decided to train model by just 2 epochs, to show you how the quantization aware training works and that the fine-tuning really improves the results of the classification. If you wan to receive better accuracy results, yoy can easily incise the number of epochs up to 5 and observe how model learns. Keep in mind that the process may take some time - the more epochs you will use, the training time will be longer.
188197

189198

@@ -196,7 +205,9 @@ def compute_metrics(eval_pred):
196205
trainer = INCTrainer(
197206
model=model,
198207
quantization_config=quantization_config,
199-
args=TrainingArguments(save_dir, num_train_epochs=2.0, do_train=True, do_eval=False),
208+
args=TrainingArguments(
209+
save_dir, num_train_epochs=2.0, do_train=True, do_eval=False
210+
),
200211
train_dataset=dataset["train"],
201212
eval_dataset=dataset["validation"],
202213
compute_metrics=compute_metrics,
@@ -205,11 +216,11 @@ def compute_metrics(eval_pred):
205216

206217

207218
# ## Train the model
208-
#
219+
#
209220
# Now, let's train the model. We will use prepared trainer by executing `train` method on it.
210-
#
211-
# You can see, that after the training information about the model are printed under `*****Mixed Precision Statistics*****`.
212-
#
221+
#
222+
# You can see, that after the training information about the model are printed under `*****Mixed Precision Statistics*****`.
223+
#
213224
# Now, the model use INT8 instead of FP32 in every layer.
214225

215226

@@ -220,8 +231,8 @@ def compute_metrics(eval_pred):
220231

221232

222233
# ## Evaluate the model
223-
#
224-
# After the training we should evaluate our model using `evaluate()` method on prepared trainer. It will show results for prepared before evaluation metrics - evaluation accuracy and loss. Additionally we will have information about evaluation time, samples and steps per second and number of epochs model was trained by.
234+
#
235+
# After the training we should evaluate our model using `evaluate()` method on prepared trainer. It will show results for prepared before evaluation metrics - evaluation accuracy and loss. Additionally we will have information about evaluation time, samples and steps per second and number of epochs model was trained by.
225236

226237

227238
# In[ ]:
@@ -232,7 +243,7 @@ def compute_metrics(eval_pred):
232243

233244

234245
# After the training it is important to save the model. One again we will use prepared trainer and other method - `save_model()`. Our model will be saved in the location provided before.
235-
# After that, to use this model in the future you just need load it similarly as at the beginning, using dedicated Intel® Neural Compressor optimized method `INCModelForSequenceClassification.from_pretrained(...)`.
246+
# After that, to use this model in the future you just need load it similarly as at the beginning, using dedicated Intel® Neural Compressor optimized method `INCModelForSequenceClassification.from_pretrained(...)`.
236247

237248

238249
# In[ ]:
@@ -243,11 +254,10 @@ def compute_metrics(eval_pred):
243254
model = INCModelForSequenceClassification.from_pretrained(save_dir)
244255

245256

246-
# In this code sample we use BERT-tiny and emotion dataset to create text classification model using Intel® Neural Compressor Quantization Aware Training. We encourage you to experiment with this code sample changing model and datasets to make text models for different classification tasks.
257+
# In this code sample we use BERT-tiny and emotion dataset to create text classification model using Intel® Neural Compressor Quantization Aware Training. We encourage you to experiment with this code sample changing model and datasets to make text models for different classification tasks.
247258

248259

249260
# In[ ]:
250261

251262

252263
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")
253-

0 commit comments

Comments
 (0)