Skip to content

Commit e24f969

Browse files
Add a HowTo on searching the Hub efficiently (#576)
* Update python-tests.yml * Update and rename python-tests.yml to python-tests-windows.yml * Fix typo in example * Workflow * Keep workflow same * Write searching tutorial * Search -> Explore + Hugging Face Hub Co-authored-by: Omar Sanseviero <[email protected]> * Just call HFHub "Hub" after the first time Co-authored-by: Omar Sanseviero <[email protected]> * First -> Third person Co-authored-by: Omar Sanseviero <[email protected]> * Intuitive + fun emoji Co-authored-by: Omar Sanseviero <[email protected]> * Make it sound exciting Co-authored-by: Omar Sanseviero <[email protected]> * Graphic -> Animation + Third Person Co-authored-by: Omar Sanseviero <[email protected]> * Order of thinking Co-authored-by: Omar Sanseviero <[email protected]> * We -> you + clarity Co-authored-by: Omar Sanseviero <[email protected]> * Update docs/hub/searching-the-hub.md Co-authored-by: Omar Sanseviero <[email protected]> * Update docs/hub/searching-the-hub.md Co-authored-by: Omar Sanseviero <[email protected]> * Update docs/hub/searching-the-hub.md Co-authored-by: Omar Sanseviero <[email protected]> * Update docs/hub/searching-the-hub.md Co-authored-by: Omar Sanseviero <[email protected]> * Update docs/hub/searching-the-hub.md Co-authored-by: Omar Sanseviero <[email protected]> * Alt text * Install directions + wording * Add datasets section Co-authored-by: Omar Sanseviero <[email protected]>
1 parent 683d564 commit e24f969

File tree

3 files changed

+327
-0
lines changed

3 files changed

+327
-0
lines changed

docs/assets/hub/search_glue.gif

37 KB
Loading
83.9 KB
Loading

docs/hub/searching-the-hub.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# Searching the Hub Efficiently with Python
2+
3+
In this tutorial, we will explore how to interact and explore the Hugging Face Hub with the `huggingface_hub` library to help find available models and datasets quickly.
4+
5+
## The Basics
6+
7+
`huggingface_hub` is a Python library that allows anyone to freely extract useful information from the Hub, as well as downloading and publishing models. You can install it with:
8+
9+
10+
```bash
11+
pip install huggingface_hub
12+
```
13+
14+
It comes packaged with an interface that can interact with the Hub in the `HfApi` class:
15+
16+
17+
```python
18+
>>> from huggingface_hub import HfApi
19+
>>> api = HfApi()
20+
```
21+
22+
This class let's you perform a variety of operations that interact with the raw Hub API. We'll be focusing on two specific funtions:
23+
- `list_models`
24+
- `list_datasets`
25+
26+
If you look at what can be passed into each function, you will find the parameter list looks something like:
27+
- `filter`
28+
- `author`
29+
- `search`
30+
- ...
31+
32+
Two of these parameters are intuitive (`author` and `search`), but what about that `filter`? 🤔 Let's dive into a few helpers quickly and revisit that question.
33+
34+
## Search Parameters
35+
36+
The `huggingface_hub` provides a user-friendly interface to know what exactly can be passed into this `filter` parameter through the `ModelSearchArguments` and `DatasetSearchArguments` classes:
37+
38+
39+
```python
40+
>>> from huggingface_hub import ModelSearchArguments, DatasetSearchArguments
41+
42+
>>> model_args = ModelSearchArguments()
43+
>>> dataset_args = DatasetSearchArguments()
44+
```
45+
46+
These are nested namespace objects that have **every single option** available on the Hub and that will return what should be passed to `filter`. The best of all is: it has tab completion 🎊 .
47+
48+
## Searching for a Model
49+
50+
Let's pose a problem that would be complicated to solve without access to this information:
51+
> I want to search the Hub for all PyTorch models trained on the `glue` dataset that can do Text Classification.
52+
53+
If you check what is available in `model_args` by checking it's output, you will find:
54+
55+
56+
```python
57+
>>> model_args
58+
```
59+
60+
61+
62+
63+
Available Attributes or Keys:
64+
* author
65+
* dataset
66+
* language
67+
* library
68+
* license
69+
* model_name
70+
* pipeline_tag
71+
72+
73+
74+
It has a variety of attributes or keys available to you. This is because it is both an object and a dictionary, so you can either do `model_args["author"]` or `model_args.author`. For this tutorial, let's follow the latter format.
75+
76+
The first criteria is getting all PyTorch models. This would be found under the `library` attribute, so let's see if it is there:
77+
78+
79+
```python
80+
>>> model_args.library
81+
```
82+
83+
84+
85+
86+
Available Attributes or Keys:
87+
* AdapterTransformers
88+
* Asteroid
89+
* ESPnet
90+
* Flair
91+
* JAX
92+
* Joblib
93+
* Keras
94+
* ONNX
95+
* PyTorch
96+
* Pyannote
97+
* Rust
98+
* Scikit_learn
99+
* SentenceTransformers
100+
* Stanza
101+
* TFLite
102+
* TensorBoard
103+
* TensorFlow
104+
* TensorFlowTTS
105+
* Timm
106+
* Transformers
107+
* allennlp
108+
* fastText
109+
* fastai
110+
* spaCy
111+
* speechbrain
112+
113+
114+
115+
It is! The `PyTorch` name is there, so you'll need to use `model_args.library.PyTorch`:
116+
117+
118+
```python
119+
>>> model_args.library.PyTorch
120+
```
121+
122+
123+
124+
125+
'pytorch'
126+
127+
128+
129+
Below is an animation repeating the process for finding both the `Text Classification` and `glue` requirements:
130+
131+
![Animation exploring `model_args.pipeline_tag`](../assets/hub/search_text_classification.gif)
132+
133+
![Animation exploring `model_args.dataset`](../assets/hub/search_glue.gif)
134+
135+
Now that all the pieces are there, the last step is to combine them all for something the API can use through the `ModelFilter` and `DatasetFilter` classes. The classes transform the outputs of the previous step into something the API can use conveniently:
136+
137+
138+
```python
139+
>>> from huggingface_hub import ModelFilter, DatasetFilter
140+
141+
>>> filt = ModelFilter(
142+
>>> task=args.pipeline_tag.TextClassification,
143+
>>> trained_dataset=args.dataset.glue,
144+
>>> library=args.library.PyTorch
145+
>>> )
146+
>>> api.list_models(filter=filt)[0]
147+
```
148+
149+
150+
151+
152+
ModelInfo: {
153+
modelId: 09panesara/distilbert-base-uncased-finetuned-cola
154+
sha: f89a85cb8703676115912fffa55842f23eb981ab
155+
lastModified: 2021-12-21T14:03:01.000Z
156+
tags: ['pytorch', 'tensorboard', 'distilbert', 'text-classification', 'dataset:glue', 'transformers', 'license:apache-2.0', 'generated_from_trainer', 'model-index', 'infinity_compatible']
157+
pipeline_tag: text-classification
158+
siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='.gitignore'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tokenizer.json'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='training_args.bin'), ModelFile(rfilename='vocab.txt'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/events.out.tfevents.1640094759.bc62d5d57d92.77.0'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/events.out.tfevents.1640095117.bc62d5d57d92.77.2'), ModelFile(rfilename='runs/Dec21_13-51-40_bc62d5d57d92/1640094759.4067502/events.out.tfevents.1640094759.bc62d5d57d92.77.1')]
159+
config: None
160+
private: False
161+
downloads: 6
162+
library_name: transformers
163+
likes: 0
164+
}
165+
166+
167+
168+
As you can see, it found the models that fit all the criteria. You can even take it further by passing in an array for each of the parameters from before. For example, let's take a look for the same configuration, but also include `TensorFlow` in the filter:
169+
170+
171+
```python
172+
>>> filt = ModelFilter(
173+
>>> task=args.pipeline_tag.TextClassification,
174+
>>> library=[args.library.PyTorch, args.library.TensorFlow]
175+
>>> )
176+
>>> api.list_models(filter=filt)[0]
177+
```
178+
179+
180+
181+
182+
ModelInfo: {
183+
modelId: CAMeL-Lab/bert-base-arabic-camelbert-ca-poetry
184+
sha: bc50b6dc1c97dc66998287efb6d044bdaa8f7057
185+
lastModified: 2021-10-17T12:09:38.000Z
186+
tags: ['pytorch', 'tf', 'bert', 'text-classification', 'ar', 'arxiv:1905.05700', 'arxiv:2103.06678', 'transformers', 'license:apache-2.0', 'infinity_compatible']
187+
pipeline_tag: text-classification
188+
siblings: [ModelFile(rfilename='.gitattributes'), ModelFile(rfilename='README.md'), ModelFile(rfilename='config.json'), ModelFile(rfilename='pytorch_model.bin'), ModelFile(rfilename='special_tokens_map.json'), ModelFile(rfilename='tf_model.h5'), ModelFile(rfilename='tokenizer_config.json'), ModelFile(rfilename='training_args.bin'), ModelFile(rfilename='vocab.txt')]
189+
config: None
190+
private: False
191+
downloads: 21
192+
library_name: transformers
193+
likes: 0
194+
}
195+
196+
197+
198+
## Searching for a Dataset
199+
200+
Similarly to finding a model, you can find a dataset easily by following the same steps.
201+
202+
The new scenario will be:
203+
> I want to search the Hub for all datasets that can be used for `text_classification` and are in English.
204+
205+
First, you should look at what is available in the `DatasetSearchArguments`, similar to the `ModelSearchArguments`:
206+
207+
208+
```python
209+
>>> dataset_args = DatasetSearchArguments()
210+
>>> dataset_args
211+
```
212+
213+
214+
215+
216+
Available Attributes or Keys:
217+
* author
218+
* benchmark
219+
* dataset_name
220+
* language_creators
221+
* languages
222+
* licenses
223+
* multilinguality
224+
* size_categories
225+
* task_categories
226+
* task_ids
227+
228+
229+
230+
`text_classification` is a *task*, so first you should check `task_categories`:
231+
232+
233+
```python
234+
dataset_args.task_categories
235+
```
236+
237+
238+
239+
240+
Available Attributes or Keys:
241+
* Summarization
242+
* audio_classification
243+
* automatic_speech_recognition
244+
* code_generation
245+
* conditional_text_generation
246+
* cross_language_transcription
247+
* dialogue_system
248+
* grammaticalerrorcorrection
249+
* machine_translation
250+
* named_entity_disambiguation
251+
* named_entity_recognition
252+
* natural_language_inference
253+
* news_classification
254+
* other
255+
* other_test
256+
* other_text_search
257+
* paraphrase
258+
* paraphrasedetection
259+
* query_paraphrasing
260+
* question_answering
261+
* question_generation
262+
* sentiment_analysis
263+
* sequence_modeling
264+
* speech_processing
265+
* structure_prediction
266+
* summarization
267+
* text_classification
268+
* text_generation
269+
* text_retrieval
270+
* text_scoring
271+
* textual_entailment
272+
* translation
273+
274+
275+
276+
There you will find `text_classification`, so you should use `dataset_args.task_categories.text_classification`.
277+
278+
Next we need to find the proper language. There is a `languages` property we can check. These are two-letter language codes, so you should check if it has `en`:
279+
280+
281+
```python
282+
>>> "en" in dataset_args.languages
283+
```
284+
285+
286+
287+
288+
True
289+
290+
291+
292+
Now that the pieces are found, you can write a filter:
293+
294+
295+
```python
296+
>>> filt = DatasetFilter(
297+
>>> languages=dataset_args.languages.en,
298+
>>> task_categories=dataset_args.task_categories.text_classification
299+
>>> )
300+
```
301+
302+
And search the API!
303+
304+
305+
```python
306+
>>> api.list_datasets(filter=filt)[0]
307+
```
308+
309+
310+
311+
312+
DatasetInfo: {
313+
id: Abirate/english_quotes
314+
lastModified: None
315+
tags: ['annotations_creators:expert-generated', 'language_creators:expert-generated', 'language_creators:crowdsourced', 'languages:en', 'multilinguality:monolingual', 'source_datasets:original', 'task_categories:text-classification', 'task_ids:multi-label-classification']
316+
private: False
317+
author: Abirate
318+
description: None
319+
citation: None
320+
cardData: None
321+
siblings: None
322+
gated: False
323+
}
324+
325+
326+
327+
With these two functionalities combined, you can search for all available parameters and tags within the Hub to search for with ease for both Datasets and Models!

0 commit comments

Comments
 (0)