deeppavlov
diff --git a/‎CONTRIBUTING.md‎
Lines changed: 34 additions & 24 deletions b/‎CONTRIBUTING.md‎
Lines changed: 34 additions & 24 deletions
diff --git a/‎Makefile‎
Lines changed: 4 additions & 3 deletions b/‎Makefile‎
Lines changed: 4 additions & 3 deletions
diff --git a/‎assets/classification_pipeline.png‎
-134 KB b/‎assets/classification_pipeline.png‎
-134 KB
diff --git a/‎assets/dependency-graph.png‎
-73.4 KB b/‎assets/dependency-graph.png‎
-73.4 KB
diff --git a/‎autointent/configs/__init__.py‎
Lines changed: 9 additions & 1 deletion b/‎autointent/configs/__init__.py‎
Lines changed: 9 additions & 1 deletion
diff --git a/‎autointent/configs/_transformers.py‎
Lines changed: 21 additions & 0 deletions b/‎autointent/configs/_transformers.py‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎autointent/metrics/retrieval.py‎
Lines changed: 1 addition & 1 deletion b/‎autointent/metrics/retrieval.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎autointent/modules/base/_base.py‎
Lines changed: 1 addition & 0 deletions b/‎autointent/modules/base/_base.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎autointent/modules/embedding/_logreg.py‎
Lines changed: 2 additions & 4 deletions b/‎autointent/modules/embedding/_logreg.py‎
Lines changed: 2 additions & 4 deletions
diff --git a/‎autointent/modules/embedding/_retrieval.py‎
Lines changed: 2 additions & 4 deletions b/‎autointent/modules/embedding/_retrieval.py‎
Lines changed: 2 additions & 4 deletions
@@ -1,63 +1,73 @@
 # Contribute to AutoIntent
 
-## Минимальная конфигурация
+## Minimum Configuration
 
-Мы используем `poetry` в качесте менеджера зависимостей и упаковщика.
+We use `poetry` as our dependency manager and packager.
 
-1. Установить `poetry`. Советуем обратиться к разделу официальной документации [Installation with the official installer](https://python-poetry.org/docs/#installing-with-the-official-installer). Если кратко, то достаточно просто запустить команду:
+1. Install `poetry`. We recommend referring to the official documentation section [Installation with the official installer](https://python-poetry.org/docs/#installing-with-the-official-installer). In short, you just need to run:
 ```bash
 curl -sSL https://install.python-poetry.org | python3 -
 ```
 
-2. Склонировать проект, перейти в корень
+2. Clone the project and navigate to the root directory
 
-3. Установить проект со всеми зависимостями:
+3. Install the project with all dependencies:
 ```bash
 make install
 ```
 
-## Дополнительно
+## Additional Setup
 
-Чтобы удобнее трекать ошибки в кодстайле, советуем установить расширение ruff для IDE. Например, для VSCode
+To make it easier to track code style errors, we recommend installing the ruff extension for your IDE. For example, for VSCode:
 ```
 https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff
 ```
-С этим расширением ошибки в кодстайле будут подчеркиваться прямо в редакторе.
+With this extension, code style errors will be underlined directly in the editor.
 
-В корень проекта добавлен файл `.vscode/settings.json`, который указывает расширению путь к конфигу линтера.
+A `.vscode/settings.json` file has been added to the project root, which points the extension to the linter configuration.
 
 ## Contribute
 
-1. Создать ветку, в которой вы будете работать. Чтобы остальным было проще понимать характер вашего контрибьюта, нужно давать краткие, но понятные названия начинающиеся. Советем начинать названия на `feat/` для веток с новыми фичами, `fix/` для исправления багов, `refactor/` для рефакторинга, `test/` для добавления тестов.
+1. Create a branch for your work. To make it easier for others to understand the nature of your contribution, use brief but clear names. We recommend starting branch names with `feat/` for new features, `fix/` for bug fixes, `refactor/` for refactoring, and `test/` for adding tests.
 
-2. Коммит, коммит, коммит, коммит
+2. Commit, commit, commit, commit
 
-3. Если есть новые фичи, желательно добавить для них тесты в директорию [tests](./tests).
+3. If there are new features, it's advisable to add tests for them in the [tests](./tests) directory.
 
-4. Проверить, что внесенные изменения не ломают имеющиеся фичи
+4. You can open a PR!
+
+Every commit in any PR triggers github actions with automated tests. All checks block merging into the main branch (with rare exceptions).
+
+Sometimes waiting for CI can be long, and sometimes it's more convenient to run individual tests:
+- Check that your changes don't break existing features
 ```bash
 make test
 ```
-
-5. Проверить кодстайл
+Or run a specific test (using `test_bert.py` as an example):
+```bash
+poetry run pytest tests/modules/scoring/test_bert.py
+```
+- Check code style (it also applies formatter)
 ```bash
 make lint
 ```
+- Check type hints:
+```bash
+make typing
+```
+Note: If mypy shows different errors locally compared to github actions, you should update your local dependencies:
+```bash
+make update
+```
 
-6. Ура, можно открывать Pull Request!
-
-## Устройство проекта
-
-![](assets/dependency-graph.png)
-
-## Построение документации
+## Building Documentation
 
-Построить html версию в папке `docs/build`:
+Build the HTML version in the `docs/build` folder:
 ```bash
 make docs
 ```
 
-Построить html версию и захостить локально:
+Build the HTML version and host it locally:
 ```bash
 make serve-docs
 ```
@@ -22,9 +22,10 @@ lint:
 	$(poetry) ruff format
 	$(poetry) ruff check --fix
 
-.PHONY: sync
-sync:
-	poetry sync --extras "dev test typing docs"
+.PHONY: update
+update:
+	rm -f poetry.lock
+	poetry install --extras "dev test typing docs"
 
 .PHONY: docs
 docs:
 
@@ -2,11 +2,19 @@
 
 from ._inference_node import InferenceNodeConfig
 from ._optimization import DataConfig, LoggingConfig
-from ._transformers import CrossEncoderConfig, EmbedderConfig, HFModelConfig, TaskTypeEnum, TokenizerConfig
+from ._transformers import (
+    CrossEncoderConfig,
+    EarlyStoppingConfig,
+    EmbedderConfig,
+    HFModelConfig,
+    TaskTypeEnum,
+    TokenizerConfig,
+)
 
 __all__ = [
     "CrossEncoderConfig",
     "DataConfig",
+    "EarlyStoppingConfig",
     "EmbedderConfig",
     "HFModelConfig",
     "InferenceNodeConfig",
 
@@ -4,6 +4,9 @@
 from pydantic import BaseModel, ConfigDict, Field, PositiveInt
 from typing_extensions import Self, assert_never
 
+from autointent.custom_types import FloatFromZeroToOne
+from autointent.metrics import SCORING_METRICS_MULTICLASS, SCORING_METRICS_MULTILABEL
+
 
 class TokenizerConfig(BaseModel):
     padding: bool | Literal["longest", "max_length", "do_not_pad"] = True
@@ -122,3 +125,21 @@ class CrossEncoderConfig(HFModelConfig):
     tokenizer_config: TokenizerConfig = Field(
         default_factory=lambda: TokenizerConfig(max_length=512)
     )  # this is because sentence-transformers doesn't allow you to customize tokenizer settings properly
+
+
+class EarlyStoppingConfig(BaseModel):
+    val_fraction: float = Field(
+        0.2,
+        description=(
+            "Fraction of train samples to allocate to dev set to monitor quality "
+            "during training and perofrm early stopping if quality doesn't enhances."
+        ),
+    )
+    patience: PositiveInt = Field(1, description="Maximum number of epoches to wait for quality to enhance.")
+    threshold: FloatFromZeroToOne = Field(
+        0.0,
+        description="Minimum quality increment to count it as enhancement. Default: any incremeant is counted",
+    )
+    metric: Literal[tuple((SCORING_METRICS_MULTILABEL | SCORING_METRICS_MULTICLASS).keys())] | None = Field(  # type: ignore[valid-type]
+        "scoring_f1", description="Metric to monitor."
+    )
@@ -539,7 +539,7 @@ def retrieval_ndcg(query_labels: LABELS_VALUE_TYPE, candidates_labels: CANDIDATE
     query_label_, candidates_labels_ = transform(query_labels, candidates_labels)
 
     ndcg_scores: list[float] = []
-    relevance_scores: npt.NDArray[np.bool] = query_label_[:, None] == candidates_labels_
+    relevance_scores = query_label_[:, None] == candidates_labels_
 
     for rel_scores in relevance_scores:
         cur_dcg = _dcg(rel_scores, k)
 
@@ -187,6 +187,7 @@ def score_metrics_cv(  # type: ignore[no-untyped-def]
         all_val_preds = []
 
         for train_utterances, train_labels, val_utterances, val_labels in cv_iterator:
+            self.clear_cache()
             self.fit(train_utterances, train_labels, **fit_kwargs)  # type: ignore[arg-type]
             val_preds = self.predict(val_utterances)
             for name, fn in metrics_dict.items():
 
@@ -81,7 +81,8 @@ def from_context(
 
     def clear_cache(self) -> None:
         """Clear embedder from memory."""
-        self._embedder.clear_ram()
+        if hasattr(self, "_embedder"):
+            self._embedder.clear_ram()
 
     def fit(self, utterances: list[str], labels: ListOfLabels) -> None:
         """Train the logistic regression model using the provided utterances and labels.
@@ -90,9 +91,6 @@ def fit(self, utterances: list[str], labels: ListOfLabels) -> None:
             utterances: List of text data to index
             labels: List of corresponding labels for the utterances
         """
-        if hasattr(self, "_embedder"):
-            self.clear_cache()
-
         self._validate_task(labels)
 
         self._embedder = Embedder(
 
@@ -83,9 +83,6 @@ def fit(self, utterances: list[str], labels: ListOfLabels) -> None:
             utterances: List of text data to index
             labels: List of corresponding labels for the utterances
         """
-        if hasattr(self, "_vector_index"):
-            self.clear_cache()
-
         self._validate_task(labels)
 
         self._vector_index = VectorIndex(
@@ -140,7 +137,8 @@ def get_assets(self) -> EmbeddingArtifact:
 
     def clear_cache(self) -> None:
         """Clear cached data in memory used by the vector index."""
-        self._vector_index.clear_ram()
+        if hasattr(self, "_vector_index"):
+            self._vector_index.clear_ram()
 
     def predict(self, utterances: list[str]) -> list[ListOfLabels]:
         """Predict the nearest neighbors for a list of utterances.