Skip to content

Commit ca2c1e0

Browse files
authored
Merge branch 'master' into release_0.15.0
2 parents fe7a737 + 0becfed commit ca2c1e0

File tree

14 files changed

+375
-55
lines changed

14 files changed

+375
-55
lines changed

docs/conf.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55

66
# -- Project information -----------------------------------------------------
77
from sphinx_github_style import get_linkcode_resolve
8+
from torch.nn import Module
89

910
version = "0.15.0"
1011
release = "0.15.0"
@@ -100,7 +101,7 @@ def linkcode_resolve(*args):
100101
html_show_sphinx = False
101102

102103
# Napoleon settings
103-
napoleon_include_init_with_doc = True
104+
napoleon_include_init_with_doc = False
104105
napoleon_include_private_with_doc = False
105106

106107
autodoc_default_options = {

docs/tutorial/tutorial-training/how-model-training-works.md

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -279,16 +279,10 @@ print(sentence.to_tagged_string())
279279

280280
If the model works well, it will correctly tag 'love' as a verb in this example.
281281

282-
## Summary
282+
## Next
283283

284-
This tutorial gave you a general overview of the main steps to train a model:
284+
Congrats, you now have a general overview of the main steps to train a model in Flair!
285285

286-
- load a corpus
287-
- choose a label type
288-
- create a label dictionary
289-
- choose embeddings
290-
- initialize model
291-
- initialize trainer
292-
- train
286+
Next, learn about the [two main training approaches in Flair](train-vs-fine-tune.md).
293287

294288

docs/tutorial/tutorial-training/how-to-load-custom-dataset.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -159,3 +159,6 @@ example we chose `label_type='topic'` to denote that we are loading a corpus wit
159159

160160

161161

162+
## Next
163+
164+
Next, learn [how to train a sequence tagger](how-to-train-sequence-tagger.md).

docs/tutorial/tutorial-training/how-to-load-prepared-dataset.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -193,3 +193,7 @@ The following datasets are supported:
193193
| Universal Dependency Treebanks | [flair.datasets.treebanks](#flair.datasets.treebanks) |
194194
| OCR-Layout-NER | [flair.datasets.ocr](#flair.datasets.ocr) |
195195

196+
197+
## Next
198+
199+
Next, learn how to load a [custom dataset](how-to-load-custom-dataset.md).

docs/tutorial/tutorial-training/how-to-train-sequence-tagger.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -223,3 +223,6 @@ trainer.train('resources/taggers/example-universal-pos',
223223
This gives you a multilingual model. Try experimenting with more languages!
224224

225225

226+
## Next
227+
228+
Next, learn [how to train a text classifier](how-to-train-text-classifier.md).

docs/tutorial/tutorial-training/how-to-train-text-classifier.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,7 @@ classifier.predict(sentence)
5858
print(sentence.labels)
5959
```
6060

61+
62+
## Next
63+
64+
Next, learn [how to train an entity linker](how-to-train-span-classifier.md).
Lines changed: 40 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,50 @@
11
# Training vs fine-tuning
22

33
There are two broad ways you train a model: The "classic" approach and the fine-tuning approach. This section
4-
explains the differences, and the things you need to do.
4+
explains the differences.
55

66

77
## Fine-Tuning
88

9+
Fine-tuning is the current state-of-the-art approach. The main idea is that you take a pre-trained language model that
10+
consists of (hundreds of) millions of trained parameters. To this language model you add a simple prediction head with
11+
randomly initialized weights.
12+
13+
Since in this case, the vast majority of parameters in the model is already trained, you only need to "fine-tune" this
14+
model. This means: Very small learning rate (LR) and just a few epochs. You are essentially just minimally modifying
15+
the model to adapt it to the task you want to solve.
16+
17+
Use this method by calling [`ModelTrainer.fine_tune()`](#flair.trainers.ModelTrainer.fine_tune).
18+
Since most models in Flair were trained this way, this is likely the approach you'll want to use.
19+
920

1021
## Training
1122

23+
On the other hand, you should use the classic training approach if the majority of the trainable parameters in your
24+
model is randomly initialized. This can happen for instance if you freeze the model weights of the pre-trained language
25+
model, leaving only the randomly initialited prediction head as trainable parameters. This training approach is also
26+
referred to as "feature-based" or "probing" in some papers.
27+
28+
Since the majority of parameters is randomly initialized, you need to fully train the model. This means: high learning
29+
rate and many epochs.
30+
31+
Use this method by calling [`ModelTrainer.train()`](#flair.trainers.ModelTrainer.train) .
32+
33+
```{note}
34+
Another application of classic training is for linear probing of pre-trained language models. In this scenario, you
35+
"freeze" the weights of the language model (meaning that they cannot be changed) and add a prediction head that is
36+
trained from scratch. So, even though a language model is involved, its parameters are not trainable. This means that
37+
all trainable parameters in this scenario are randomly initialized, therefore necessitating the use of the classic
38+
training approach.
39+
```
40+
41+
42+
## Paper
43+
44+
If you are interested in an experimental comparison of the two above-mentioned approach, check out [our paper](https://arxiv.org/pdf/2011.06993)
45+
that compares fine-tuning to the feature-based approach.
46+
47+
48+
## Next
49+
50+
Next, learn how to load a [training dataset](how-to-load-prepared-dataset.md).

flair/data.py

Lines changed: 73 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1372,6 +1372,14 @@ def unlabeled_identifier(self) -> str:
13721372

13731373

13741374
class Corpus(typing.Generic[T_co]):
1375+
"""The main object in Flair for holding a dataset used for training and testing.
1376+
1377+
A corpus consists of three splits: A `train` split used for training, a `dev` split used for model selection
1378+
and/or early stopping and a `test` split used for testing. All three splits are optional, so it is possible
1379+
to create a corpus only using one or two splits. If the option `sample_missing_splits` is set to True,
1380+
missing splits will be randomly sampled from the training split.
1381+
"""
1382+
13751383
def __init__(
13761384
self,
13771385
train: Optional[Dataset[T_co]] = None,
@@ -1381,6 +1389,26 @@ def __init__(
13811389
sample_missing_splits: Union[bool, str] = True,
13821390
random_seed: Optional[int] = None,
13831391
) -> None:
1392+
"""
1393+
Constructor method to initialize a :class:`Corpus`. You can define the train, dev and test split
1394+
by passing the corresponding Dataset object to the constructor. At least one split should be defined.
1395+
If the option `sample_missing_splits` is set to True, missing splits will be randomly sampled from the
1396+
train split.
1397+
1398+
In most cases, you will not use the constructor yourself. Rather, you will create a corpus using one of our
1399+
helper methods that read common NLP filetypes. For instance, you can use
1400+
:class:`flair.datasets.sequence_labeling.ColumnCorpus` to read CoNLL-formatted files directly into
1401+
a :class:`Corpus`.
1402+
1403+
Args:
1404+
train: The split you use for model training.
1405+
dev: A holdout split typically used for model selection or early stopping.
1406+
test: The final test data to compute the score of the model.
1407+
name: A name that identifies the corpus.
1408+
sample_missing_splits: If set to True, missing splits are sampled from train. If set to False,
1409+
missing splits are not sampled and left empty. Default: True.
1410+
random_seed: Set a random seed to control the sampling of missing splits.
1411+
"""
13841412
# set name
13851413
self.name: str = name
13861414

@@ -1419,14 +1447,17 @@ def __init__(
14191447

14201448
@property
14211449
def train(self) -> Optional[Dataset[T_co]]:
1450+
"""The training split as a :class:`torch.utils.data.Dataset` object."""
14221451
return self._train
14231452

14241453
@property
14251454
def dev(self) -> Optional[Dataset[T_co]]:
1455+
"""The dev split as a :class:`torch.utils.data.Dataset` object."""
14261456
return self._dev
14271457

14281458
@property
14291459
def test(self) -> Optional[Dataset[T_co]]:
1460+
"""The test split as a :class:`torch.utils.data.Dataset` object."""
14301461
return self._test
14311462

14321463
def downsample(
@@ -1443,12 +1474,12 @@ def downsample(
14431474
data points. It additionally returns a pointer to itself for use in method chaining.
14441475
14451476
Args:
1446-
percentage (float): A float value between 0. and 1. that indicates to which percentage the corpus
1477+
percentage: A float value between 0. and 1. that indicates to which percentage the corpus
14471478
should be downsampled. Default value is 0.1, meaning it gets downsampled to 10%.
1448-
downsample_train (bool): Whether or not to include the training split in downsampling. Default is True.
1449-
downsample_dev (bool): Whether or not to include the dev split in downsampling. Default is True.
1450-
downsample_test (bool): Whether or not to include the test split in downsampling. Default is True.
1451-
random_seed (int): An optional random seed to make downsampling reproducible.
1479+
downsample_train: Whether or not to include the training split in downsampling. Default is True.
1480+
downsample_dev: Whether or not to include the dev split in downsampling. Default is True.
1481+
downsample_test: Whether or not to include the test split in downsampling. Default is True.
1482+
random_seed: An optional random seed to make downsampling reproducible.
14521483
14531484
Returns:
14541485
A pointer to itself for optional use in method chaining.
@@ -1580,9 +1611,17 @@ def _downsample_to_proportion(dataset: Dataset, proportion: float, random_seed:
15801611
return splits[0]
15811612

15821613
def obtain_statistics(self, label_type: Optional[str] = None, pretty_print: bool = True) -> Union[dict, str]:
1583-
"""Print statistics about the class distribution and sentence sizes.
1614+
"""Print statistics about the corpus, including the length of the sentences and the labels in the corpus.
15841615
1585-
only labels of sentences are taken into account
1616+
Args:
1617+
label_type: Optionally set this value to obtain statistics only for one specific type of label (such
1618+
as "ner" or "pos"). If not set, statistics for all labels will be returned.
1619+
pretty_print: If set to True, returns pretty json (indented for readabilty). If not, the json is
1620+
returned as a single line. Default: True.
1621+
1622+
Returns:
1623+
If pretty_print is True, returns a pretty print formatted string in json format. Otherwise, returns a
1624+
dictionary holding a json.
15861625
"""
15871626
json_data = {
15881627
"TRAIN": self._obtain_statistics_for(self.train, "TRAIN", label_type),
@@ -1654,7 +1693,21 @@ def make_label_dictionary(
16541693
) -> Dictionary:
16551694
"""Creates a dictionary of all labels assigned to the sentences in the corpus.
16561695
1657-
:return: dictionary of labels
1696+
Args:
1697+
label_type: The name of the label type for which the dictionary should be created. Some corpora have
1698+
multiple layers of annotation, such as "pos" and "ner". In this case, you should choose the label type
1699+
you are interested in.
1700+
min_count: Optionally set this to exclude rare labels from the dictionary (i.e., labels seen fewer
1701+
than the provided integer value).
1702+
add_unk: Optionally set this to True to include a "UNK" value in the dictionary. In most cases, this
1703+
is not needed since the label dictionary is well-defined, but some use cases might have open classes
1704+
and require this.
1705+
add_dev_test: Optionally set this to True to construct the label dictionary not only from the train
1706+
split, but also from dev and test. This is only necessary if some labels never appear in train but do
1707+
appear in one of the other splits.
1708+
1709+
Returns:
1710+
A Dictionary of all unique labels in the corpus.
16581711
"""
16591712
if min_count > 0 and not add_unk:
16601713
add_unk = True
@@ -1833,13 +1886,25 @@ def add_label_noise(
18331886
)
18341887

18351888
def get_label_distribution(self):
1889+
"""Counts occurrences of each label in the corpus and returns them as a dictionary object.
1890+
1891+
This allows you to get an idea of which label appears how often in the Corpus.
1892+
1893+
Returns:
1894+
Dictionary with labels as keys and their occurrences as values.
1895+
"""
18361896
class_to_count = defaultdict(lambda: 0)
18371897
for sent in self.train:
18381898
for label in sent.labels:
18391899
class_to_count[label.value] += 1
18401900
return class_to_count
18411901

18421902
def get_all_sentences(self) -> ConcatDataset:
1903+
"""Returns all sentences (spanning all three splits) in the :class:`Corpus`.
1904+
1905+
Returns:
1906+
A :class:`torch.utils.data.Dataset` object that includes all sentences of this corpus.
1907+
"""
18431908
parts = []
18441909
if self.train:
18451910
parts.append(self.train)

flair/datasets/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -217,6 +217,7 @@
217217
NER_MULTI_WIKINER,
218218
NER_MULTI_XTREME,
219219
NER_NERMUD,
220+
NER_NOISEBENCH,
220221
NER_SWEDISH,
221222
NER_TURKU,
222223
NER_UKRAINIAN,
@@ -496,6 +497,7 @@
496497
"NER_GERMAN_MOBIE",
497498
"NER_GERMAN_POLITICS",
498499
"NER_HIPE_2022",
500+
"NER_NOISEBENCH",
499501
"NER_HUNGARIAN",
500502
"NER_ICDAR_EUROPEANA",
501503
"NER_ICELANDIC",

0 commit comments

Comments
 (0)