Metrics update #48

ArneRobben · 2026-01-13T10:56:07Z

What does this change?

This PR adds in a few things:

I renamed preprocess.py to train_test_split.py. It felt to me that most of the preprocessing is now done in src/data_processing/* and the main utility of the old preprocess.py was just to split the data in train and test
a separate metrics.py file where all metrics are calculated
we now calculate the following metrics:
- (as before) the usual 'lowest level' HRCS RAC tags, metrics both on an aggregate (such as F1 macro & Micro) and on a per label level
- we also calculate the parent level metrics, where we take the first digit of a lowest level classification to infer the parent. E.g. "1.1: Normal biological development and functioning (underpinning)" maps to "1: Underpinning research"
- to get a sense of how the metrics do on Wellcome's grants, we also calculate metrics when test data is filtered for Wellcome grants only. In order to do this, a train_meta.parquet and test_meta.parquet file are created which contains the FundingOrganisation field so we can filter for Wellcome. The metrics basically filter the test data for grants that are only related to Wellcome. This is likely to perform a little worse because of underrepresentation of some of the classes but Wellcome is still the second largest funding organisation in the data. We also report on the parent level and super parent level of Wellcome only metrics.

How to test

make train_test_split and make train_ra

How can we measure success?

Success in this case if the code does what it needs to do and if the metrics we care about (listed above) are logged in wandb

Have we considered potential risks?

The classifier might be considered worse for "Wellcome only" in the test set. The onus of explaining this is on us

justinbt1

Quick review as I know you are still working on this, everything appears to be working correctly. However there are a few bits where the code might need cleaning up, see the comments below.

justinbt1 · 2026-01-16T17:02:29Z

src/train_test_split.py

+        train_meta = pd.DataFrame(index=np.arange(len(train)))
+        test_meta = pd.DataFrame(index=np.arange(len(test)))
+
+    return train, test, train_meta, test_meta


This function is very long and does quite a few different tasks, could it be split into separate functions?
See https://ds.wellcome.data/python_standards/#functions for details and reasoning.

justinbt1 · 2026-01-16T17:06:40Z

src/train_test_split.py

+    df = df.sample(frac=1, random_state=10).reset_index(drop=True)
+    df["_row_id"] = np.arange(len(df), dtype=np.int64)
+
+    # Labels (multilabel binarized)


I think this may be obvious from below Class and variable names, consider removing this comment?
Think of comments as simply another line of code we have to maintain.
See https://ds.wellcome.data/python_standards/#comments

justinbt1 · 2026-01-16T17:09:01Z

src/metrics.py

+
+        return all_metrics
+
+    return compute_metrics


This function is very long and will need to be refactored, more details on function sizing and why it's important can be found in the Python Language Rules section of our code standards.

justinbt1 · 2026-01-16T17:09:48Z

src/train.py

-        }
-        return metrics
-
-    return compute_metrics


Moving metric calculation to another module is a really good idea!

justinbt1 · 2026-01-16T17:14:05Z

Makefile

+	python src/train_test_split.py \
 		"config/train_config.yaml" \
 		"data/clean/clean.parquet" \
 		"data/preprocessed"


The change of this target name also needs to be updated in associated documentation.

justinbt1 · 2026-01-23T10:46:29Z

src/metrics.py

+                        wandb.log({"wellcome_metrics_status": "funding_org_missing"})
+                    elif len(meta_df) != labels.shape[0]:
+                        print("Info: test_meta length does not match eval set; skipping Wellcome metrics.")
+                        wandb.log({"wellcome_metrics_status": "length_mismatch"})


Out of interest, why do we have so many edge cases?

justinbt1 · 2026-01-23T10:56:56Z

src/metrics.py

+        num_tags_predicted = []
+        if config["training_settings"]["output_weighting"]:
+            thresholds = [1] * labels.shape[1]
+            output_thresholds = config["training_settings"].get("output_thresholds", [0.2, 0.5, 0.8, 0.95])


See above comment on line width.

justinbt1 · 2026-01-23T11:03:49Z

src/metrics.py

+                            wandb.log({"wellcome_metrics_status": "computed"})
+                        else:
+                            print("Info: No rows for FundingOrganisation == 'Wellcome Trust' in eval set")
+                            wandb.log({"wellcome_metrics_status": "no_wellcome_rows"})


Could the content of this try except block be moved to another function? This would also avoid the need to nest a large block of code.

justinbt1 · 2026-01-23T11:35:05Z

src/train_test_split.py

+        train (pd.DataFrame): Training data.
+        test (pd.DataFrame): Test data.
+        output_dir (str): Output directory.
+    """


The train_meta and test_meta parameters are missing from the docstring.

justinbt1 · 2026-01-23T11:39:29Z

src/train_test_split.py

+
+    # Randomly shuffle dataframe and reset index; add a stable row id
+    df = df.sample(frac=1, random_state=10).reset_index(drop=True)
+    df["_row_id"] = np.arange(len(df), dtype=np.int64)


I might be missing something but could the cleaned dataset index not be used as the values for the new stable row id column? This could allow for easier merging with original dataset for metric calculation without needing to create intermediate meta datasets? Might reduce the number of datasets and code to handle create, read and handle them.

ArneRobben added 7 commits January 8, 2026 13:26

refactoring metrics

689aed3

adding in parent level eval

843a54b

label dir fix

14b64e4

adding in Wellcome metrics

d27da60

adding in label names

745573f

super parents addition

c3d5a86

train_test_split.py restore

c473b57

ArneRobben requested a review from justinbt1 January 13, 2026 10:56

wandb output clean up

d28d31d

justinbt1 linked an issue Jan 14, 2026 that may be closed by this pull request

Add Additional Metrics #53

Open

ArneRobben mentioned this pull request Jan 14, 2026

Add Additional Metrics #53

Open

justinbt1 reviewed Jan 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics update #48

Metrics update #48

ArneRobben commented Jan 13, 2026

Uh oh!

justinbt1 left a comment

Uh oh!

justinbt1 Jan 16, 2026

Uh oh!

justinbt1 Jan 16, 2026

Uh oh!

justinbt1 Jan 16, 2026

Uh oh!

justinbt1 Jan 16, 2026

Uh oh!

justinbt1 Jan 16, 2026

Uh oh!

justinbt1 Jan 23, 2026

Uh oh!

justinbt1 Jan 23, 2026

Uh oh!

justinbt1 Jan 23, 2026

Uh oh!

justinbt1 Jan 23, 2026

Uh oh!

justinbt1 Jan 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Metrics update #48

Are you sure you want to change the base?

Metrics update #48

Conversation

ArneRobben commented Jan 13, 2026

What does this change?

How to test

How can we measure success?

Have we considered potential risks?

Uh oh!

justinbt1 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants