You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: model2vec/distill/distillation.py
+89-27Lines changed: 89 additions & 27 deletions
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,8 @@ def distill_from_model(
40
40
vocabulary: list[str] |None=None,
41
41
device: str|None=None,
42
42
pca_dims: PCADimType=256,
43
-
apply_zipf: bool=True,
43
+
apply_zipf: bool|None=None,
44
+
sif_coefficient: float|None=1e-4,
44
45
use_subword: bool=True,
45
46
token_remove_pattern: str|None=r"\[unused\d+\]",
46
47
) ->StaticModel:
@@ -60,30 +61,19 @@ def distill_from_model(
60
61
:param pca_dims: The number of components to use for PCA.
61
62
If this is None, we don't apply PCA.
62
63
If this is 'auto', we don't reduce dimensionality, but still apply PCA.
63
-
:param apply_zipf: Whether to apply Zipf weighting to the embeddings.
64
+
:param apply_zipf: DEPRECATED: This parameter used to control whether Zipf is applied.
65
+
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
66
+
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
67
+
Should be a value > 0 and < 1.0. A value of 1e-4 is a good default.
64
68
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
65
69
:param token_remove_pattern: If this is set to a string, we compile this into a regex. Any tokens that conform to this regex pattern will be removed from the vocabulary.
66
70
If the pattern is so general that it removes all tokens, we throw an error. If the pattern can't be compiled into a valid regex, we also throw an error.
67
-
:raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.
68
-
:raises: ValueError if the vocabulary contains duplicate tokens.
69
-
:raises: ValueError if the regex can't be compiled.
70
-
:raises: ValueError if the vocabulary is empty after token removal.
71
71
:return: A StaticModel
72
72
73
73
"""
74
-
device=select_optimal_device(device)
75
-
ifnotuse_subwordandvocabularyisNone:
76
-
raiseValueError(
77
-
"You must pass a vocabulary if you don't use subword tokens. Either pass a vocabulary, or set use_subword to True."
Validate the parameters passed to the distillation function.
170
+
171
+
:param tokenizer: The tokenizer to use.
172
+
:param vocabulary: The vocabulary to use.
173
+
:param apply_zipf: DEPRECATED: This parameter used to control whether Zipf is applied.
174
+
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
175
+
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
176
+
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
177
+
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
178
+
:return: The SIF coefficient to use.
179
+
:raises: ValueError if the PCA dimension is larger than the number of dimensions in the embeddings.
180
+
:raises: ValueError if the vocabulary contains duplicate tokens.
181
+
:raises: ValueError if the regex can't be compiled.
182
+
:raises: ValueError if the vocabulary is empty after token removal.
183
+
184
+
"""
185
+
ifapply_zipfisnotNone:
186
+
logger.warning(
187
+
"The `apply_zipf` parameter is deprecated and will be removed in the next release. "
188
+
"Zipf weighting is applied based on the sif_coefficient parameter. If this is set to None, "
189
+
"no weighting is applied."
190
+
)
191
+
ifapply_zipfandsif_coefficientisNone:
192
+
logger.warning("You set apply_zipf to True, but sif_coefficient is None. Setting sif_coefficient to 1e-4.")
193
+
sif_coefficient=1e-4
194
+
elifnotapply_zipf:
195
+
logger.warning("Because you set apply_zipf to False, we ignore the sif_coefficient parameter.")
196
+
sif_coefficient=None
197
+
198
+
ifsif_coefficientisnotNone:
199
+
ifnot0<sif_coefficient<1.0:
200
+
raiseValueError("SIF coefficient must be a value > 0 and < 1.0.")
201
+
202
+
ifnotuse_subwordandvocabularyisNone:
203
+
raiseValueError(
204
+
"You must pass a vocabulary if you don't use subword tokens. Either pass a vocabulary, or set use_subword to True."
:param pca_dims: The number of components to use for PCA.
222
276
If this is None, we don't apply PCA.
223
277
If this is 'auto', we don't reduce dimenionality, but still apply PCA.
224
-
:param apply_zipf: Whether to apply Zipf weighting to the embeddings.
278
+
:param apply_zipf: DEPRECATED: This parameter used to control whether Zipf is applied.
279
+
Zipf weighting is now controlled by the sif_coefficient parameter. If this is set to None, no weighting is applied.
280
+
:param sif_coefficient: The SIF coefficient to use. If this is None, no weighting is applied.
281
+
Should be a value >= 0 and < 1.0. A value of 1e-4 is a good default.
225
282
:param use_subword: Whether to keep subword tokens in the vocabulary. If this is False, you must pass a vocabulary, and the returned tokenizer will only detect full words.
226
283
:param token_remove_pattern: If this is set to a string, we compile this into a regex. Any tokens that conform to this regex pattern will be removed from the vocabulary.
227
284
:param trust_remote_code: Whether to trust the remote code. If this is False, we will only load components coming from `transformers`. If this is True, we will load all components.
0 commit comments