Skip to content

Commit 426af0d

Browse files
Feat/basic presets (#146)
* move `EmbedderConfig` and `CrossEncoderConfig` to `autointent.configs` * remove `_datafiles` * implement search space filtering * add `warning` mode * change default model for cross-encoder * refactor constructor of `SklearnScorer` * add the heaviest search space preset * add search space validation for `DescriptionScorer` * add default value for `ThresholdDecision` * implement default configs for embedder and cross encoder * add transformers configs to pipeline * implement two basic presets * fix codestyle * fix typing * Update optimizer_config.schema.json * bug fix and update test * search space validation bug found * update unit tests * update test * improve sklearn test * refactor sklearn scorer * remove `VectorIndexConfig` entirely from our lib * try to fix validation errors * remove multiclass/multilabel separation on modules dicts * upd test * fix codestyle * Update optimizer_config.schema.json * something's wrong with sklearn scorer again * remove unnecessary default value from sklearn scorer constructor * add default value for `weights` in knn and rerank scorer * try without search space validation * Update optimizer_config.schema.json * foolish bug fix * Update optimizer_config.schema.json * finish implementing presets * Update optimizer_config.schema.json * update docs * update docs and readme * upd ci * respond to samoed * update tests * fix codestyle * Update optimizer_config.schema.json * remove sklearn scorer from config for now * bug fix --------- Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
1 parent 70ed53a commit 426af0d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

68 files changed

+932
-727
lines changed
Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,38 @@
1+
name: test presets
2+
3+
on:
4+
push:
5+
branches:
6+
- dev
7+
pull_request:
8+
9+
jobs:
10+
test:
11+
runs-on: ${{ matrix.os }}
12+
strategy:
13+
fail-fast: false
14+
matrix:
15+
os: [ ubuntu-latest ]
16+
python-version: [ "3.10", "3.11", "3.12" ]
17+
include:
18+
- os: windows-latest
19+
python-version: "3.10"
20+
21+
steps:
22+
- name: Checkout code
23+
uses: actions/checkout@v4
24+
25+
- name: Setup Python ${{ matrix.python-version }}
26+
uses: actions/setup-python@v5
27+
with:
28+
python-version: ${{ matrix.python-version }}
29+
cache: "pip"
30+
31+
- name: Install dependencies
32+
run: |
33+
pip install .
34+
pip install pytest pytest-asyncio
35+
36+
- name: Run tests
37+
run: |
38+
pytest tests/pipeline/test_presets.py

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ Example of building an intent classifier in a couple of lines of code:
3030
from autointent import Pipeline, Dataset
3131

3232
dataset = Dataset.from_json(path_to_json)
33-
pipeline = Pipeline.default_optimizer(multilabel=False)
33+
pipeline = Pipeline.from_preset("light")
3434
pipeline.fit(dataset)
35-
pipeline.predict(["show me my latest recent transactions"])
35+
pipeline.predict(["show me my latest transactions"])
3636
```

autointent/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
from ._dataset import Dataset
88
from ._hash import Hasher
99
from .context import Context, load_dataset
10+
from ._optimization_config import OptimizationConfig
1011
from ._pipeline import Pipeline
1112

1213

@@ -15,6 +16,7 @@
1516
"Dataset",
1617
"Embedder",
1718
"Hasher",
19+
"OptimizationConfig",
1820
"Pipeline",
1921
"Ranker",
2022
"VectorIndex",

autointent/_datafiles/default-multiclass-config.yaml

Lines changed: 0 additions & 26 deletions
This file was deleted.

autointent/_datafiles/default-multilabel-config.yaml

Lines changed: 0 additions & 21 deletions
This file was deleted.

autointent/_datafiles/inference-config-example.yaml

Lines changed: 0 additions & 17 deletions
This file was deleted.

autointent/_dataset/_dataset.py

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
"""File with Dataset definition."""
22

33
import json
4+
import logging
45
from collections import defaultdict
56
from functools import cached_property
67
from pathlib import Path
@@ -12,6 +13,8 @@
1213
from autointent.custom_types import LabelWithOOS, Split
1314
from autointent.schemas import Intent, Tag
1415

16+
logger = logging.getLogger(__name__)
17+
1518

1619
class Sample(TypedDict):
1720
"""
@@ -36,6 +39,7 @@ class Dataset(dict[str, HFDataset]):
3639

3740
label_feature = "label"
3841
utterance_feature = "utterance"
42+
has_descriptions: bool
3943

4044
def __init__(self, *args: Any, intents: list[Intent], **kwargs: Any) -> None: # noqa: ANN401
4145
"""
@@ -49,6 +53,8 @@ def __init__(self, *args: Any, intents: list[Intent], **kwargs: Any) -> None: #
4953

5054
self.intents = intents
5155

56+
self.has_descriptions = self.validate_descriptions()
57+
5258
@property
5359
def multilabel(self) -> bool:
5460
"""
@@ -197,3 +203,18 @@ def _to_multilabel(self, sample: Sample) -> Sample:
197203
ohe_vector[sample["label"]] = 1
198204
sample["label"] = ohe_vector
199205
return sample
206+
207+
def validate_descriptions(self) -> bool:
208+
"""
209+
Check whether the dataset contains text descriptions for each intent.
210+
211+
:return: True if all intents have description field
212+
"""
213+
has_any = any(intent.description is not None for intent in self.intents)
214+
has_all = all(intent.description is not None for intent in self.intents)
215+
216+
if has_any and not has_all:
217+
msg = "Some intents have text descriptions, but some of them not."
218+
logger.warning(msg)
219+
220+
return has_all

autointent/_dump_tools.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,8 @@
1212
from sklearn.base import BaseEstimator
1313

1414
from autointent import Embedder, Ranker, VectorIndex
15-
from autointent.schemas import CrossEncoderConfig, EmbedderConfig, TagsList
15+
from autointent.configs import CrossEncoderConfig, EmbedderConfig
16+
from autointent.schemas import TagsList
1617

1718
ModuleSimpleAttributes = None | str | int | float | bool | list # type: ignore[type-arg]
1819

autointent/_embedder.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
from sentence_transformers import SentenceTransformer
1818

1919
from ._hash import Hasher
20-
from .schemas import EmbedderConfig, TaskTypeEnum
20+
from .configs import EmbedderConfig, TaskTypeEnum
2121

2222

2323
def get_embeddings_path(filename: str) -> Path:

autointent/_optimization_config.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
from pydantic import BaseModel, PositiveInt
2+
3+
from .configs import CrossEncoderConfig, DataConfig, EmbedderConfig, LoggingConfig
4+
from .custom_types import SamplerType
5+
from .nodes.schemes import OptimizationSearchSpaceConfig
6+
7+
8+
class OptimizationConfig(BaseModel):
9+
"""Configuration for the optimization process."""
10+
11+
data_config: DataConfig = DataConfig()
12+
search_space: OptimizationSearchSpaceConfig
13+
logging_config: LoggingConfig = LoggingConfig()
14+
embedder_config: EmbedderConfig = EmbedderConfig()
15+
cross_encoder_config: CrossEncoderConfig = CrossEncoderConfig()
16+
sampler: SamplerType = "brute"
17+
seed: PositiveInt = 42

0 commit comments

Comments
 (0)