Skip to content

Added CatBoostScorer#209

Merged
voorhs merged 55 commits intodevfrom
feat/catboost_scorer
Jun 18, 2025
Merged

Added CatBoostScorer#209
voorhs merged 55 commits intodevfrom
feat/catboost_scorer

Conversation

@nikiduki
Copy link
Collaborator

@nikiduki nikiduki commented May 14, 2025

В catboost_scorer.py добавил dump и load для CatBoostScorer. После можно будет их отредактировать и взять в _dump_tools

verbose: bool = False,
**catboost_kwargs: Any, # noqa: ANN401
) -> None:
self.classification_model_config = EmbedderConfig.from_search_config(classification_model_config)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Если classification_model_config None, то создастся EmbedderConfig с дефолтными значениями. У тебя тест test_catboost_without_embedder получается не то тестирует


def encode(texts: list[str]) -> npt.NDArray[np.float32]:
with torch.no_grad():
batch = tokenizer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

У EmbedderConfig есть параметры для токенизации в tokenizer_config

Comment on lines 172 to 177

def _init_text_tools(self) -> None:
if not hasattr(self, "_tokenizer"):
self._tokenizer = Tokenizer(lowercasing=True, separator_type="BySense", token_types=["Word", "Number"])
if not hasattr(self, "_dictionary"):
self._dictionary = Dictionary(occurence_lower_bound=1, gram_order=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Тут тоже можно убрать

Comment on lines 210 to 214
y_mat = np.zeros((len(labels), self._n_classes), dtype=np.float32)
for i, lbls in enumerate(cast("Sequence[Sequence[int]]", labels)):
for class_i, lbl in enumerate(lbls):
y_mat[i, class_i] = lbl
y = y_mat
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно просто заменить на np.asarray(labels)?

self._dictionary_fitted = False

def get_embedder_config(self) -> dict[str, Any]:
return self.embedder_config.model_dump()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Если self._use_embedder False, то ошибка будет

Comment on lines 123 to 128
if not hasattr(self, "_tokenizer"):
self._tokenizer = Tokenizer(lowercasing=True, separator_type="BySense", token_types=["Word", "Number"])
if not hasattr(self, "_dictionary"):
self._dictionary = Dictionary(occurence_lower_bound=1, gram_order=1)
if not hasattr(self, "_dictionary_fitted"):
self._dictionary_fitted = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Надо сделать как-то задаваемым как так? https://catboost.ai/docs/en/references/text-processing__test-processing__default-value
Просто кажется что можно не указывать tokenizer, dictironary и тд

voorhs and others added 9 commits June 14, 2025 21:40
* fix

* sklearn scorer proper name

* fix typing errors

* try to fix pydantic errors
* Update wandb.py

* Update wandb.py

* Update wandb.py

* Update _optimization_info.py

* remove print
* fix few shot split

* lint
* change how `clear_cache` is called

* first version of early stopping

* change mypy version

* train_test_split bug fix

* add `compute_metrics` and `EarlyStoppingCallback`

* bug fix

* fix mypy

* try to fix `"eval_f1" not found` error

* forgot to upd `from_context`

* try to fix mypy

* ty to fix "not found f1" error

* refactor a little bit

* disable early stopping for lora

* fix typing errors

* update contributing and makefile

* minor change

* use our metrics

* add docstrings

* set 3.10 for mypy

* upd contributing.md

* try to fix bug

* try to fix typing issue

* try to fix

* add early stopping to ptuning
* add test for configuration

* lint

* satisfy mypy
* add prompt logging

* Update optimizer_config.schema.json

* fix

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
* fix default prompt

* allow to use default prompt with override
@voorhs voorhs merged commit ac1b732 into dev Jun 18, 2025
21 of 22 checks passed
@voorhs voorhs deleted the feat/catboost_scorer branch June 18, 2025 16:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants