Skip to content

Conversation

@ArneRobben
Copy link
Collaborator

What does this change?

This PR adds in a few things:

  • I renamed preprocess.py to train_test_split.py. It felt to me that most of the preprocessing is now done in src/data_processing/* and the main utility of the old preprocess.py was just to split the data in train and test
  • a separate metrics.py file where all metrics are calculated
  • we now calculate the following metrics:
    • (as before) the usual 'lowest level' HRCS RAC tags, metrics both on an aggregate (such as F1 macro & Micro) and on a per label level
    • we also calculate the parent level metrics, where we take the first digit of a lowest level classification to infer the parent. E.g. "1.1: Normal biological development and functioning (underpinning)" maps to "1: Underpinning research"
    • to get a sense of how the metrics do on Wellcome's grants, we also calculate metrics when test data is filtered for Wellcome grants only. In order to do this, a train_meta.parquet and test_meta.parquet file are created which contains the FundingOrganisation field so we can filter for Wellcome. The metrics basically filter the test data for grants that are only related to Wellcome. This is likely to perform a little worse because of underrepresentation of some of the classes but Wellcome is still the second largest funding organisation in the data. We also report on the parent level and super parent level of Wellcome only metrics.

How to test

make train_test_split and make train_ra

How can we measure success?

Success in this case if the code does what it needs to do and if the metrics we care about (listed above) are logged in wandb

Have we considered potential risks?

The classifier might be considered worse for "Wellcome only" in the test set. The onus of explaining this is on us

@ArneRobben ArneRobben requested a review from justinbt1 January 13, 2026 10:56
@justinbt1 justinbt1 linked an issue Jan 14, 2026 that may be closed by this pull request
Copy link
Collaborator

@justinbt1 justinbt1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick review as I know you are still working on this, everything appears to be working correctly. However there are a few bits where the code might need cleaning up, see the comments below.

train_meta = pd.DataFrame(index=np.arange(len(train)))
test_meta = pd.DataFrame(index=np.arange(len(test)))

return train, test, train_meta, test_meta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is very long and does quite a few different tasks, could it be split into separate functions?
See https://ds.wellcome.data/python_standards/#functions for details and reasoning.

df = df.sample(frac=1, random_state=10).reset_index(drop=True)
df["_row_id"] = np.arange(len(df), dtype=np.int64)

# Labels (multilabel binarized)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this may be obvious from below Class and variable names, consider removing this comment?
Think of comments as simply another line of code we have to maintain.
See https://ds.wellcome.data/python_standards/#comments


return all_metrics

return compute_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is very long and will need to be refactored, more details on function sizing and why it's important can be found in the Python Language Rules section of our code standards.

}
return metrics

return compute_metrics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving metric calculation to another module is a really good idea!

python src/train_test_split.py \
"config/train_config.yaml" \
"data/clean/clean.parquet" \
"data/preprocessed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change of this target name also needs to be updated in associated documentation.

wandb.log({"wellcome_metrics_status": "funding_org_missing"})
elif len(meta_df) != labels.shape[0]:
print("Info: test_meta length does not match eval set; skipping Wellcome metrics.")
wandb.log({"wellcome_metrics_status": "length_mismatch"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of interest, why do we have so many edge cases?

num_tags_predicted = []
if config["training_settings"]["output_weighting"]:
thresholds = [1] * labels.shape[1]
output_thresholds = config["training_settings"].get("output_thresholds", [0.2, 0.5, 0.8, 0.95])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment on line width.

wandb.log({"wellcome_metrics_status": "computed"})
else:
print("Info: No rows for FundingOrganisation == 'Wellcome Trust' in eval set")
wandb.log({"wellcome_metrics_status": "no_wellcome_rows"})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the content of this try except block be moved to another function? This would also avoid the need to nest a large block of code.

train (pd.DataFrame): Training data.
test (pd.DataFrame): Test data.
output_dir (str): Output directory.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The train_meta and test_meta parameters are missing from the docstring.


# Randomly shuffle dataframe and reset index; add a stable row id
df = df.sample(frac=1, random_state=10).reset_index(drop=True)
df["_row_id"] = np.arange(len(df), dtype=np.int64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something but could the cleaned dataset index not be used as the values for the new stable row id column? This could allow for easier merging with original dataset for metric calculation without needing to create intermediate meta datasets? Might reduce the number of datasets and code to handle create, read and handle them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add Additional Metrics

3 participants