|
| 1 | +# Searching the Hub Efficiently with Python |
| 2 | + |
| 3 | +In this tutorial, we will explore how to interact and explore the Hugging Face Hub with the `huggingface_hub` library to help find available models and datasets quickly. |
| 4 | + |
| 5 | +## The Basics |
| 6 | + |
| 7 | +`huggingface_hub` is a Python library that allows anyone to freely extract useful information from the Hub, as well as downloading and publishing models. You can install it with: |
| 8 | + |
| 9 | + |
| 10 | +```bash |
| 11 | +pip install huggingface_hub |
| 12 | +``` |
| 13 | + |
| 14 | +It comes packaged with an interface that can interact with the Hub in the `HfApi` class: |
| 15 | + |
| 16 | + |
| 17 | +```python |
| 18 | +>>> from huggingface_hub import HfApi |
| 19 | +>>> api = HfApi() |
| 20 | +``` |
| 21 | + |
| 22 | +This class let's you perform a variety of operations that interact with the raw Hub API. We'll be focusing on two specific funtions: |
| 23 | +- `list_models` |
| 24 | +- `list_datasets` |
| 25 | + |
| 26 | +If you look at what can be passed into each function, you will find the parameter list looks something like: |
| 27 | +- `filter` |
| 28 | +- `author` |
| 29 | +- `search` |
| 30 | +- ... |
| 31 | + |
| 32 | +Two of these parameters are intuitive (`author` and `search`), but what about that `filter`? 🤔 Let's dive into a few helpers quickly and revisit that question. |
| 33 | + |
| 34 | +## Search Parameters |
| 35 | + |
| 36 | +The `huggingface_hub` provides a user-friendly interface to know what exactly can be passed into this `filter` parameter through the `ModelSearchArguments` and `DatasetSearchArguments` classes: |
| 37 | + |
| 38 | + |
| 39 | +```python |
| 40 | +>>> from huggingface_hub import ModelSearchArguments, DatasetSearchArguments |
| 41 | + |
| 42 | +>>> model_args = ModelSearchArguments() |
| 43 | +>>> dataset_args = DatasetSearchArguments() |
| 44 | +``` |
| 45 | + |
| 46 | +These are nested namespace objects that have **every single option** available on the Hub and that will return what should be passed to `filter`. The best of all is: it has tab completion 🎊 . |
| 47 | + |
| 48 | +## Searching for a Model |
| 49 | + |
| 50 | +Let's pose a problem that would be complicated to solve without access to this information: |
| 51 | +> I want to search the Hub for all PyTorch models trained on the `glue` dataset that can do Text Classification. |
| 52 | +
|
| 53 | +If you check what is available in `model_args` by checking it's output, you will find: |
| 54 | + |
| 55 | + |
| 56 | +```python |
| 57 | +>>> model_args |
| 58 | +``` |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | + |
| 63 | + Available Attributes or Keys: |
| 64 | + * author |
| 65 | + * dataset |
| 66 | + * language |
| 67 | + * library |
| 68 | + * license |
| 69 | + * model_name |
| 70 | + * pipeline_tag |
| 71 | + |
| 72 | + |
| 73 | + |
| 74 | +It has a variety of attributes or keys available to you. This is because it is both an object and a dictionary, so you can either do `model_args["author"]` or `model_args.author`. For this tutorial, let's follow the latter format. |
| 75 | + |
| 76 | +The first criteria is getting all PyTorch models. This would be found under the `library` attribute, so let's see if it is there: |
| 77 | + |
| 78 | + |
| 79 | +```python |
| 80 | +>>> model_args.library |
| 81 | +``` |
| 82 | + |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | + Available Attributes or Keys: |
| 87 | + * AdapterTransformers |
| 88 | + * Asteroid |
| 89 | + * ESPnet |
| 90 | + * Flair |
| 91 | + * JAX |
| 92 | + * Joblib |
| 93 | + * Keras |
| 94 | + * ONNX |
| 95 | + * PyTorch |
| 96 | + * Pyannote |
| 97 | + * Rust |
| 98 | + * Scikit_learn |
| 99 | + * SentenceTransformers |
| 100 | + * Stanza |
| 101 | + * TFLite |
| 102 | + * TensorBoard |
| 103 | + * TensorFlow |
| 104 | + * TensorFlowTTS |
| 105 | + * Timm |
| 106 | + * Transformers |
| 107 | + * allennlp |
| 108 | + * fastText |
| 109 | + * fastai |
| 110 | + * spaCy |
| 111 | + * speechbrain |
| 112 | + |
| 113 | + |
| 114 | + |
| 115 | +It is! The `PyTorch` name is there, so you'll need to use `model_args.library.PyTorch`: |
| 116 | + |
| 117 | + |
| 118 | +```python |
| 119 | +>>> model_args.library.PyTorch |
| 120 | +``` |
| 121 | + |
| 122 | + |
| 123 | + |
| 124 | + |
| 125 | + 'pytorch' |
| 126 | + |
| 127 | + |
| 128 | + |
| 129 | +Below is an animation repeating the process for finding both the `Text Classification` and `glue` requirements: |
| 130 | + |
| 131 | + |
| 132 | + |
| 133 | + |
| 134 | + |
| 135 | +Now that all the pieces are there, the last step is to combine them all for something the API can use through the `ModelFilter` and `DatasetFilter` classes. The classes transform the outputs of the previous step into something the API can use conveniently: |
| 136 | + |
| 137 | + |
| 138 | +```python |
| 139 | +>>> from huggingface_hub import ModelFilter, DatasetFilter |
| 140 | + |
| 141 | +>>> filt = ModelFilter( |
| 142 | +>>> task=args.pipeline_tag.TextClassification, |
| 143 | +>>> trained_dataset=args.dataset.glue, |
| 144 | +>>> library=args.library.PyTorch |
| 145 | +>>> ) |
| 146 | +>>> api.list_models(filter=filt)[0] |
| 147 | +``` |
| 148 | + |
| 149 | + |
| 150 | + |
| 151 | + |
| 152 | + ModelInfo: { |
| 153 | + modelId: 09panesara/distilbert-base-uncased-finetuned-cola |
| 154 | + sha: f89a85cb8703676115912fffa55842f23eb981ab |
| 155 | + lastModified: 2021-12-21T14:03:01.000Z |
| 156 | + tags: ['pytorch', 'tensorboard', 'distilbert', 'text-classification', 'dataset:glue', 'transformers', 'license:apache-2.0', 'generated_from_trainer', 'model-index', 'infinity_compatible'] |
| 157 | + pipeline_tag: text-classification |
| 158 | + siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='.gitignore'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tokenizer.json'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='training_args.bin'), ModelFile(rfilename='vocab.txt'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/events.out.tfevents.1640094759.bc62d5d57d92.77.0'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/events.out.tfevents.1640095117.bc62d5d57d92.77.2'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/1640094759.4067502/events.out.tfevents.1640094759.bc62d5d57d92.77.1')] |
| 159 | + config: None |
| 160 | + private: False |
| 161 | + downloads: 6 |
| 162 | + library_name: transformers |
| 163 | + likes: 0 |
| 164 | + } |
| 165 | + |
| 166 | + |
| 167 | + |
| 168 | +As you can see, it found the models that fit all the criteria. You can even take it further by passing in an array for each of the parameters from before. For example, let's take a look for the same configuration, but also include `TensorFlow` in the filter: |
| 169 | + |
| 170 | + |
| 171 | +```python |
| 172 | +>>> filt = ModelFilter( |
| 173 | +>>> task=args.pipeline_tag.TextClassification, |
| 174 | +>>> library=[args.library.PyTorch, args.library.TensorFlow] |
| 175 | +>>> ) |
| 176 | +>>> api.list_models(filter=filt)[0] |
| 177 | +``` |
| 178 | + |
| 179 | + |
| 180 | + |
| 181 | + |
| 182 | + ModelInfo: { |
| 183 | + modelId: CAMeL-Lab/bert-base-arabic-camelbert-ca-poetry |
| 184 | + sha: bc50b6dc1c97dc66998287efb6d044bdaa8f7057 |
| 185 | + lastModified: 2021-10-17T12:09:38.000Z |
| 186 | + tags: ['pytorch', 'tf', 'bert', 'text-classification', 'ar', 'arxiv:1905.05700', 'arxiv:2103.06678', 'transformers', 'license:apache-2.0', 'infinity_compatible'] |
| 187 | + pipeline_tag: text-classification |
| 188 | + siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tf_model.h5'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='training_args.bin'), ModelFile(rfilename='vocab.txt')] |
| 189 | + config: None |
| 190 | + private: False |
| 191 | + downloads: 21 |
| 192 | + library_name: transformers |
| 193 | + likes: 0 |
| 194 | + } |
| 195 | + |
| 196 | + |
| 197 | + |
| 198 | +## Searching for a Dataset |
| 199 | + |
| 200 | +Similarly to finding a model, you can find a dataset easily by following the same steps. |
| 201 | + |
| 202 | +The new scenario will be: |
| 203 | +> I want to search the Hub for all datasets that can be used for `text_classification` and are in English. |
| 204 | +
|
| 205 | +First, you should look at what is available in the `DatasetSearchArguments`, similar to the `ModelSearchArguments`: |
| 206 | + |
| 207 | + |
| 208 | +```python |
| 209 | +>>> dataset_args = DatasetSearchArguments() |
| 210 | +>>> dataset_args |
| 211 | +``` |
| 212 | + |
| 213 | + |
| 214 | + |
| 215 | + |
| 216 | + Available Attributes or Keys: |
| 217 | + * author |
| 218 | + * benchmark |
| 219 | + * dataset_name |
| 220 | + * language_creators |
| 221 | + * languages |
| 222 | + * licenses |
| 223 | + * multilinguality |
| 224 | + * size_categories |
| 225 | + * task_categories |
| 226 | + * task_ids |
| 227 | + |
| 228 | + |
| 229 | + |
| 230 | +`text_classification` is a *task*, so first you should check `task_categories`: |
| 231 | + |
| 232 | + |
| 233 | +```python |
| 234 | +dataset_args.task_categories |
| 235 | +``` |
| 236 | + |
| 237 | + |
| 238 | + |
| 239 | + |
| 240 | + Available Attributes or Keys: |
| 241 | + * Summarization |
| 242 | + * audio_classification |
| 243 | + * automatic_speech_recognition |
| 244 | + * code_generation |
| 245 | + * conditional_text_generation |
| 246 | + * cross_language_transcription |
| 247 | + * dialogue_system |
| 248 | + * grammaticalerrorcorrection |
| 249 | + * machine_translation |
| 250 | + * named_entity_disambiguation |
| 251 | + * named_entity_recognition |
| 252 | + * natural_language_inference |
| 253 | + * news_classification |
| 254 | + * other |
| 255 | + * other_test |
| 256 | + * other_text_search |
| 257 | + * paraphrase |
| 258 | + * paraphrasedetection |
| 259 | + * query_paraphrasing |
| 260 | + * question_answering |
| 261 | + * question_generation |
| 262 | + * sentiment_analysis |
| 263 | + * sequence_modeling |
| 264 | + * speech_processing |
| 265 | + * structure_prediction |
| 266 | + * summarization |
| 267 | + * text_classification |
| 268 | + * text_generation |
| 269 | + * text_retrieval |
| 270 | + * text_scoring |
| 271 | + * textual_entailment |
| 272 | + * translation |
| 273 | + |
| 274 | + |
| 275 | + |
| 276 | +There you will find `text_classification`, so you should use `dataset_args.task_categories.text_classification`. |
| 277 | + |
| 278 | +Next we need to find the proper language. There is a `languages` property we can check. These are two-letter language codes, so you should check if it has `en`: |
| 279 | + |
| 280 | + |
| 281 | +```python |
| 282 | +>>> "en" in dataset_args.languages |
| 283 | +``` |
| 284 | + |
| 285 | + |
| 286 | + |
| 287 | + |
| 288 | + True |
| 289 | + |
| 290 | + |
| 291 | + |
| 292 | +Now that the pieces are found, you can write a filter: |
| 293 | + |
| 294 | + |
| 295 | +```python |
| 296 | +>>> filt = DatasetFilter( |
| 297 | +>>> languages=dataset_args.languages.en, |
| 298 | +>>> task_categories=dataset_args.task_categories.text_classification |
| 299 | +>>> ) |
| 300 | +``` |
| 301 | + |
| 302 | +And search the API! |
| 303 | + |
| 304 | + |
| 305 | +```python |
| 306 | +>>> api.list_datasets(filter=filt)[0] |
| 307 | +``` |
| 308 | + |
| 309 | + |
| 310 | + |
| 311 | + |
| 312 | + DatasetInfo: { |
| 313 | + id: Abirate/english_quotes |
| 314 | + lastModified: None |
| 315 | + tags: ['annotations_creators:expert-generated', 'language_creators:expert-generated', 'language_creators:crowdsourced', 'languages:en', 'multilinguality:monolingual', 'source_datasets:original', 'task_categories:text-classification', 'task_ids:multi-label-classification'] |
| 316 | + private: False |
| 317 | + author: Abirate |
| 318 | + description: None |
| 319 | + citation: None |
| 320 | + cardData: None |
| 321 | + siblings: None |
| 322 | + gated: False |
| 323 | + } |
| 324 | + |
| 325 | + |
| 326 | + |
| 327 | +With these two functionalities combined, you can search for all available parameters and tags within the Hub to search for with ease for both Datasets and Models! |
0 commit comments