Train system and general refactoring by ianbulovic · Pull Request #230 · Machine-Learning-for-Medical-Language/cnlp_transformers

ianbulovic · 2025-05-05T19:12:47Z

This is an attempt at refactoring the messier parts of the codebase. Quite a few changes, summary below. All the pre-refactoring code is still available in the cnlpt.legacy package.

Refactored train system

The refactored train system lives in the cnlpt.train_system package. With the new setup, you can initialize the train system by creating a CnlpTrainSystem instance.

To run the new train system, use cnlpt train [ARGS]

Initialization and training

CnlpTrainSystem is created from model arguments, data arguments, and training arguments. Classmethods are also available to initialize a CnlpTrainSystem from argv, a config dictionary, or a json file.

The __init__ method of CnlpTrainSystem configures logging (more info below) and validates the provided args, then sets up the tokenizer, dataset, and model for training. Training won't actually start until the train() method is called.

Metrics and model saving

The model_selection_score and model_selection_label training arguments have been removed in favor of Trainer's built-in system to save the best model. Use the training argument --metric_for_best_model to choose your selection metric. It will default to average accuracy across all tasks, but other options are available:

Metric Name	Description
`loss`	Evaluation loss
`avg_acc` (default)	Average accuracy over all tasks
`avg_macro_f1`	Average macro-F1 over all tasks
`avg_micro_f1`	Average micro-F1 over all tasks
`TASKNAME.acc`	Accuracy for task TASKNAME
`TASKNAME.macro_f1`	Macro-F1 for task TASKNAME
`TASKNAME.micro_f1`	Micro-F1 for task TASKNAME
`TASKNAME.LABELNAME.f1`	F1 for label LABELNAME of task TASKNAME
`METRIC_1, METRIC_2, ... METRIC_N`	Average of multiple metrics (any of the above)

Reworked predictions and analysis system

This PR introduces a CnlpPredictions dataclass (in the data package) that stores information related to predictions made by the model on test data (with or without labels). These predictions can be generated with the predict() method of CnlpTrainSystem, and the dataclass has methods for JSON serialization. Using the --do_predict flag when training will automatically run predictions on the test set when the training is complete, and save them to a predictions.json file in your output directory.

There is also a new cnlpt.data.analysis module with a function that can convert a CnlpPredictions instance to a polars dataframe for analysis.

Logging and live display

Rather than relying on stdout/stderr to document the training process, all relevant information is now logged in train_system.log in the configured output directory.

By moving everything to the logfile, we can reclaim console real estate for a much more interpretable live training progress display using rich.

Refactored data processing

Most of the data processing code has also been refactored (i.e., cnlp_processors and cnlp_data). The new code, which is used by the new train system, lives in the cnlpt.data package.

The main goal of the data refactoring was to simplify a lot of code by packaging all info related to each task into a new dataclass, TaskInfo. Basically all of our data processing before required passing around a bunch of dicts mapping task names to different properties (task type, number of labels, label set, task index). Repackaging all that data on a per-task basis simplifies quite a lot.

Other stuff

Changed the module structure a bit and moved examples out of src
Added a bunch of unit tests
Added support for python 3.12
- We can revert this if it's not something we want right now, but all the tests are passing so I figured might as well 🤷‍♂️
A few minor CI updates with some new linting rules
Simplified release workflow (previous compatibility issues between uv and setuptools-scm have been resolved in newer uv versions)
Reworked the documentation system a bit, and switched to more human-readable Google-style docstrings.

TODO

When reworking the train system, I noticed that the old code only successfully sets the class weights for the CNN model; for the Hierarchical and CNLP models the class weights are taken from dataset.class_weights which (as far as I can tell) is always None. This is most likely a bug in the original train system, but since I didn't write that code I'll wait for review before fixing it in case I'm missing something.

One chunk of code that's still missing in this refactor is the error and disagreement analysis stuff in cnlp_predict.py. I don't have a good sense of how much of that code is still needed now that we can export a dataframe with much of the same information from the new CnlpPredictions dataclass via the make_preds_df function in cnlpt.data.analysis.

Despite the fact that training seems to run the same with the arguments I've tried, it's possible I accidentally broke something for someone else's use case. I'm opening this PR early as a draft so that people can test it on their own tasks/data to make sure everything is still working fine. As a reminder, run the new train system with cnlpt train [ARGS].

src/cnlpt/new_data/data_reader.py

etgld · 2025-05-05T22:18:54Z

src/cnlpt/new_data/preprocess.py

+        return [
+            tokenized_input.word_ids(i) for i in range(len(tokenized_input.input_ids))
+        ]
+    elif character_level:


I wrote all the character level code here and elsewhere for some experiments using Google's CANINE model on one of Guergana's projects. If we want to keep the code there's some cleaning up I can do, although I haven't used CANINE in a while personally

etgld

Lots of great clarity and efficiency improvements! Really liked the refactoring some of the functionalities from train_system into callbacks, those examples are really helpful.

I'm not sure if/when I'll get the chance to test any of this out but from what I can tell all of the functionality I typically use should still work

ianbulovic requested review from etgld, spencerthomas1722 and tmills May 5, 2025 19:12

etgld reviewed May 5, 2025

View reviewed changes

src/cnlpt/new_data/data_reader.py Outdated Show resolved Hide resolved

etgld reviewed May 5, 2025

View reviewed changes

etgld approved these changes May 5, 2025

View reviewed changes

ianbulovic added 24 commits June 6, 2025 11:29

move examples to examples directory

e55f883

move models to models module

1d07dbe

create args and data modules

3e79017

add train command to cli

b0b50ee

update submodules tests

e957db6

wip train system refactor

47001bf

continue train_system refactor, other misc refactoring

60e6fe2

data processing, train system, logging

a05d5a0

model saving, logging, display, tests

0280dff

update ruff version

dd5fd09

simplify build process with newer uv version

26566b1

fix lint/format errors

8f254e7

fix pre-commit error

8e60dae

run pre-commit on all files in make check

6c82d8c

fix a few type issues

5b6bc4c

simplify train system invocation from cli

297f609

add logfile path to display

963bb87

move new tests to test directory

f2bb886

move notebooks to examples

72522f5

add todos for evaluation and prediction

4355a38

fix some more type issues

0fc24d5

support python 12 and 13

c715632

consistent naming for args classes

0e44496

remove todos

98d6027

ianbulovic added 11 commits June 17, 2025 10:39

support averaging multiple selection metrics

19e0ae6

add some data tests

7143820

add prediction and evaluation to CnlpTrainSystem

9163641

close logfile in train system test

a2963b2

shutdown logging after train system test, cache HF models in CI

ec5253c

use close instead of shutdown

e7a475c

file handler is on root logger, not train system logger

f3319a5

oops

55ad8fd

add tokens to CnlpPredictions

37a1e9a

json serialization for CnlpPredictions

4adc4c7

ensure "None" label has id 0 for relations tasks

28e2ae1

ianbulovic changed the title ~~Train system and QOL refactoring~~ Train system and general refactoring Jun 27, 2025

ianbulovic marked this pull request as ready for review June 27, 2025 21:03

etgld approved these changes Jul 1, 2025

View reviewed changes

ianbulovic added 9 commits July 1, 2025 15:28

overwrite predictions file if overwrite_output_dir is true

ec73b10

add data.analysis module

11edc51

fix metric averaging

0d5ef54

extract relations in analysis dataframe

c448f97

fix chemprot preprocessing

4e3bc18

fix broken import

cf703d8

fix a couple training display issues

e6dd2c8

temporarily remove results and error analysis stuff from chemprot readme

1952b67

add a --version arg for docker builds

11608c5

ianbulovic merged commit 1794a35 into Machine-Learning-for-Medical-Language:main Jul 21, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train system and general refactoring#230

Train system and general refactoring#230
ianbulovic merged 61 commits intoMachine-Learning-for-Medical-Language:mainfrom
ianbulovic:refactoring

ianbulovic commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

etgld May 5, 2025

Uh oh!

etgld left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ianbulovic commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Refactored train system

Initialization and training

Metrics and model saving

Reworked predictions and analysis system

Logging and live display

Refactored data processing

Other stuff

TODO

Uh oh!

Uh oh!

etgld May 5, 2025

Choose a reason for hiding this comment

Uh oh!

etgld left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ianbulovic commented May 5, 2025 •

edited

Loading