|
1 | 1 | --- |
2 | | -title: End-to-End Example with training and deployment |
| 2 | +title: Get started |
3 | 3 | --- |
4 | 4 |
|
5 | | -<h1>End-to-End Example with training and deployment</h1> |
| 5 | +<h1>Train and deploy Hugging Face on Amazon SageMaker</h1> |
6 | 6 |
|
7 | | -`Not implemented yet` |
| 7 | +The get started guide will show you how to quickly use Hugging Face on Amazon SageMaker. Learn how to fine-tune and deploy a pretrained 🤗 Transformers model on SageMaker for a binary text classification task. |
| 8 | + |
| 9 | +💡 If you are new to Hugging Face, we recommend first reading the 🤗 Transformers [quick tour](https://huggingface.co/transformers/quicktour.html). |
| 10 | + |
| 11 | +<iframe width="560" height="315" src="https://www.youtube.com/embed/pYqjCzoyWyo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe> |
| 12 | + |
| 13 | +📓 Open the [notebook](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/sagemaker-notebook.ipynb) to follow along! |
| 14 | + |
| 15 | +## Installation and setup |
| 16 | + |
| 17 | +Get started by installing the necessary Hugging Face libraries and SageMaker. You will also need to install [PyTorch](https://pytorch.org/get-started/locally/) and [TensorFlow](https://www.tensorflow.org/install/pip#tensorflow-2-packages-are-available) if you don't already have it installed. |
| 18 | + |
| 19 | +```python |
| 20 | +pip install "sagemaker>=2.48.0" "transformers==4.6.1" "datasets[s3]==1.6.2" --upgrade |
| 21 | +``` |
| 22 | + |
| 23 | +If you want to run this example in [SageMaker Studio](https://docs.aws.amazon.com/sagemaker/latest/dg/studio.html), upgrade [ipywidgets](https://ipywidgets.readthedocs.io/en/latest/) for the 🤗 Datasets library and restart the kernel: |
| 24 | + |
| 25 | +```python |
| 26 | +%%capture |
| 27 | +import IPython |
| 28 | +!conda install -c conda-forge ipywidgets -y |
| 29 | +IPython.Application.instance().kernel.do_shutdown(True) |
| 30 | +``` |
| 31 | + |
| 32 | +Next, you should set up your environment: a SageMaker session and an S3 bucket. The S3 bucket will store data, models, and logs. You will need access to an [IAM execution role](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html) with the required permissions. |
| 33 | + |
| 34 | +If you are planning on using SageMaker in a local environment, you need to provide the `role` yourself. Learn more about how to set this up [here](https://huggingface.co/docs/sagemaker/train#installation-and-setup). |
| 35 | + |
| 36 | +⚠️ The execution role is only available when you run a notebook within SageMaker. If you try to run `get_execution_role` in a notebook not on SageMaker, you will get a region error. |
| 37 | + |
| 38 | +```python |
| 39 | +import sagemaker |
| 40 | + |
| 41 | +sess = sagemaker.Session() |
| 42 | +sagemaker_session_bucket = None |
| 43 | +if sagemaker_session_bucket is None and sess is not None: |
| 44 | + sagemaker_session_bucket = sess.default_bucket() |
| 45 | + |
| 46 | +role = sagemaker.get_execution_role() |
| 47 | +sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) |
| 48 | +``` |
| 49 | + |
| 50 | +## Preprocess |
| 51 | + |
| 52 | +The 🤗 Datasets library makes it easy to download and preprocess a dataset for training. Download and tokenize the [IMDb](https://huggingface.co/datasets/imdb) dataset: |
| 53 | + |
| 54 | +```python |
| 55 | +from datasets import load_dataset |
| 56 | +from transformers import AutoTokenizer |
| 57 | + |
| 58 | +# load dataset |
| 59 | +train_dataset, test_dataset = load_dataset("imdb", split=["train", "test"]) |
| 60 | + |
| 61 | +# load tokenizer |
| 62 | +tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased") |
| 63 | + |
| 64 | +# create tokenization function |
| 65 | +def tokenize(batch): |
| 66 | + return tokenizer(batch["text"], padding="max_length", truncation=True) |
| 67 | + |
| 68 | +# tokenize train and test datasets |
| 69 | +train_dataset = train_dataset.map(tokenize, batched=True) |
| 70 | +test_dataset = test.dataset.map(tokenize, batched=True) |
| 71 | + |
| 72 | +# set dataset format for PyTorch |
| 73 | +train_dataset = train_dataset.rename_column("label", "labels") |
| 74 | +train_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"]) |
| 75 | +test_dataset = test_dataset.rename_column("label", "labels") |
| 76 | +test_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"]) |
| 77 | +``` |
| 78 | + |
| 79 | +## Upload dataset to S3 bucket |
| 80 | + |
| 81 | +Next, upload the preprocessed dataset to your S3 session bucket with 🤗 Datasets S3 [filesystem](https://huggingface.co/docs/datasets/filesystems.html) implementation: |
| 82 | + |
| 83 | +```python |
| 84 | +import botocore |
| 85 | +from datasets.filesystems import S3FileSystem |
| 86 | + |
| 87 | +s3_prefix = 'samples/datasets/imdb' |
| 88 | +s3 = S3FileSystem() |
| 89 | + |
| 90 | +# save train_dataset to S3 |
| 91 | +training_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/train' |
| 92 | +train_dataset.save_to_disk(training_input_path,fs=s3) |
| 93 | + |
| 94 | +# save test_dataset to S3 |
| 95 | +test_input_path = f's3://{sess.default_bucket()}/{s3_prefix}/test' |
| 96 | +test_dataset.save_to_disk(test_input_path,fs=s3) |
| 97 | +``` |
| 98 | + |
| 99 | +## Start a training job |
| 100 | + |
| 101 | +Create a Hugging Face Estimator to handle end-to-end SageMaker training and deployment. The most important parameters to pay attention to are: |
| 102 | + |
| 103 | +* `entry_point` refers to the fine-tuning script which you can find [here](https://github.com/huggingface/notebooks/blob/master/sagemaker/01_getting_started_pytorch/scripts/train.py). |
| 104 | +* `instance_type` refers to the SageMaker instance that will be launched. Take a look [here](https://aws.amazon.com/sagemaker/pricing/) for a complete list of instance types. |
| 105 | +* `hyperparameters` refers to the training hyperparameters the model will be fine-tuned with. |
| 106 | + |
| 107 | +```python |
| 108 | +from sagemaker.huggingface import HuggingFace |
| 109 | + |
| 110 | +hyperparameters={ |
| 111 | + "epochs": 1, # number of training epochs |
| 112 | + "train_batch_size": 32, # training batch size |
| 113 | + "model_name":"distilbert-base-uncased" # name of pretrained model |
| 114 | +} |
| 115 | + |
| 116 | +huggingface_estimator = HuggingFace( |
| 117 | + entry_point="train.py", # fine-tuning script to use in training job |
| 118 | + source_dir="./scripts", # directory where fine-tuning script is stored |
| 119 | + instance_type="ml.p3.2xlarge", # instance type |
| 120 | + instance_count=1, # number of instances |
| 121 | + role=role, # IAM role used in training job to acccess AWS resources (S3) |
| 122 | + transformers_version="4.6", # Transformers version |
| 123 | + pytorch_version="1.7", # PyTorch version |
| 124 | + py_version="py36", # Python version |
| 125 | + hyperparameters=hyperparameters # hyperparameters to use in training job |
| 126 | +) |
| 127 | +``` |
| 128 | + |
| 129 | +Begin training with one line of code: |
| 130 | + |
| 131 | +```python |
| 132 | +huggingface_estimator.fit({"train": training_input_path, "test": test_input_path}) |
| 133 | +``` |
| 134 | + |
| 135 | +## Deploy model |
| 136 | + |
| 137 | +Once the training job is complete, deploy your fine-tuned model by calling `deploy()` with the number of instances and instance type: |
| 138 | + |
| 139 | +```python |
| 140 | +predictor = huggingface_estimator.deploy(initial_instance_count=1,"ml.g4dn.xlarge") |
| 141 | +``` |
| 142 | + |
| 143 | +Call `predict()` on your data: |
| 144 | + |
| 145 | +```python |
| 146 | +sentiment_input = {"inputs": "It feels like a curtain closing...there was an elegance in the way they moved toward conclusion. No fan is going to watch and feel short-changed."} |
| 147 | + |
| 148 | +predictor.predict(sentiment_input) |
| 149 | +``` |
| 150 | + |
| 151 | +After running your request, delete the endpoint: |
| 152 | + |
| 153 | +```python |
| 154 | +predictor.delete_endpoint() |
| 155 | +``` |
| 156 | + |
| 157 | +## What's next? |
| 158 | + |
| 159 | +Congratulations, you've just fine-tuned and deployed a pretrained 🤗 Transformers model on SageMaker! 🎉 |
| 160 | + |
| 161 | +For your next steps, keep reading our documentation for more details about training and deployment. There are many interesting features such as [distributed training](/docs/sagemaker/train#distributed-training) and [Spot instances](/docs/sagemaker/train#spot-instances). |
0 commit comments