Add CP_MDA-exact algorithm for correct uncertainty estimates with missing data. #231

Fdvanleeuwen · 2026-01-02T10:43:01Z

The CP_MDA-exact algorithm [1] allows users to obtain conformalised uncertainty estimates for TabPFNRegressor. This allows users to have correct uncertainty estimates based on the missing masks, as described in [2]. There are two functions:

CP_MDA_TabPFNRegressor: for fitting TabPFNRegressor and obtaining the correction terms for each mask.
CP_MDA_TabPFNRegressor_newdata: for applying the fitted model (and correction terms) to new test cases.

[1] Margaux Zaffran, Aymeric Dieuleveut, Julie Josse, and Yaniv Romano. Conformal prediction with missing
values. In International Conference on Machine Learning, pages 40578–40604. PMLR, 2023.
[2] van Leeuwen, F. D. Conformal Prediction for Tabular Prior-Data Fitted Networks with Missing data. In EurIPS 2025 Workshop: AI for Tabular Data.

Introduces the CP_missing_data extension for TabPFNRegressor, providing conformal prediction intervals in the presence of missing data. Includes implementation, example usage, and tests for calibration and prediction with missing data patterns.

Less features to reduce the number of masks for the example

chatgpt-codex-connector · 2026-01-02T10:43:05Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

CLAassistant · 2026-01-02T10:43:07Z

All committers have signed the CLA.

gemini-code-assist · 2026-01-02T10:43:16Z

Summary of Changes

Hello @Fdvanleeuwen, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the CP_MDA-exact algorithm into the tabpfn_extensions library, enabling TabPFNRegressor to generate accurate and calibrated uncertainty estimates even in the presence of missing data. The solution provides a robust framework for handling various missing data patterns by calculating specific correction terms, thereby improving the reliability of predictions for tabular data.

Highlights

New Algorithm for Missing Data: Introduces the CP_MDA-exact algorithm to provide conformalized uncertainty estimates for TabPFNRegressor when data is missing.
Two-Step Process: Implements two core functions: CP_MDA_TabPFNRegressor for model fitting and correction term calculation, and CP_MDA_TabPFNRegressor_newdata for applying these corrections to new datasets.
Research-Backed: The algorithm is based on recent research, specifically referencing papers by Zaffran et al. (2023) and van Leeuwen (EurIPS 2025).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the CP_MDA-exact algorithm for handling missing data with TabPFN, which is a valuable addition. The implementation is well-structured into two classes for training and prediction. I've found a critical issue in the data selection logic that could lead to incorrect calibration, which needs to be addressed. Additionally, I've provided feedback on improving docstrings, code style consistency, and the flexibility of the implementation. The tests are a good start, but one of them could be made more robust. The example script is clear and helpful.

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

examples/CP_missing_data/CP_missing_data_example.py

gemini-code-assist · 2026-01-02T10:45:30Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+    def run_TABPFN(self):
+        """Fit the TabPFN model."""
+        # fit model
+        m_fit = TabPFNRegressor()


The TabPFNRegressor is instantiated with its default parameters, which limits flexibility. Consider allowing users to pass custom arguments to TabPFNRegressor through the CP_MDA_TabPFNRegressor constructor.

You could modify the __init__ method to accept **tabpfn_kwargs and store them:

def __init__(self, X_train, Y_train, quantiles, val_size, seed, **tabpfn_kwargs): # ... self.tabpfn_kwargs = tabpfn_kwargs

Then, you can use these arguments when creating the instance here.

Suggested change

m_fit = TabPFNRegressor()

m_fit = TabPFNRegressor(**getattr(self, 'tabpfn_kwargs', {}))

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

src/tabpfn_extensions/CP_missing_data/__init__.py

tests/test_CP_missing_data.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Make seed optional and update name change of internal function in pipeline.

bejaeger

Hi @Fdvanleeuwen ,
thank you so much for this contribution. It looks great all in all. There are a few stylistic things I commented on that would be great if you could change them. The CI also shows some errors. If you could fix these I can have a closer look at the method.
Thanks again!

bejaeger · 2026-01-07T12:09:37Z

examples/CP_missing_data/CP_missing_data_example.py

+    from tabpfn import TabPFNRegressor
+except ImportError:
+    # Fall back to TabPFN client
+    from tabpfn_client import  TabPFNRegressor


You can use from tabpfn_extensions.utils import TabPFNClassifier, TabPFNRegressor instead :)

bejaeger · 2026-01-07T12:10:00Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+    from tabpfn import TabPFNRegressor
+except ImportError:
+    # Fall back to TabPFN client
+    from tabpfn_client import  TabPFNRegressor


Same here: from tabpfn_extensions.utils import TabPFNClassifier, TabPFNRegressor

bejaeger · 2026-01-07T12:10:29Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+    Parameters:
+        X_train : matrix-like of shape (n_samples, n_predictors)
+
+        Y_train : array-like of continuous outcome with shape (n_samples,)


Let's stick to the convention of parsing the data during the fit() call. Only configuration should go here.

bejaeger · 2026-01-07T12:11:36Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+        """Split data into training and validation sets."""
+        # create df with missing data indicator
+        missing_bool_df = self.X.isnull().astype(int)
+        self.X_train, self.X_val, Y_train_arr, Y_val_arr, self.Mask_train, self.Mask_val = train_test_split(


Our convention is to use y for the target variable (lowercase).
Please also make Mask_train/val lowercase -> mask_train/val to stick to python conventions

bejaeger · 2026-01-07T12:12:19Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+        self.split_data()
+        self.run_TABPFN()
+        self.mask_preprocess()
+        mask_unique, model = self.create_calibration_sets()


It would be easier to follow the code if you return the variables needed in the next steps rather than making them an instance variable. Would be a nice change.

bejaeger · 2026-01-07T12:12:35Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+
+    TabPFN: Fitted TabPFNRegressor model.
+
+    X_new : matrix-like of shape (n_samples, n_predictors)


Same here, let's parse X_new that during fit()

bejaeger · 2026-01-07T12:13:17Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+        """Convenience method to run the entire pipeline"""
+        self.obtain_preds()
+        self.match_mask()
+        CP_results =  self.perform_correction()


similar here, could we return and parse the arguments needed?

bejaeger · 2026-01-07T12:13:40Z

src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

+
+    """
+
+    def __init__(self, X_train, Y_train, quantiles, val_size, seed=None):


Please add type hints here and in all other functions.

Fdvanleeuwen added 4 commits December 31, 2025 11:32

Import TabPFN in a flexible way

840ca59

Update example

baf8528

Less features to reduce the number of masks for the example

update the correct scoring rule

de6575f

Fdvanleeuwen requested a review from a team as a code owner January 2, 2026 10:43

Fdvanleeuwen requested review from adrian-prior and removed request for a team January 2, 2026 10:43

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

Fdvanleeuwen and others added 10 commits January 2, 2026 12:23

Update src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

7ba1129

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/tabpfn_extensions/CP_missing_data/__init__.py

2fdf17b

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

1f2ae55

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update examples/CP_missing_data/CP_missing_data_example.py

662a17d

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update src/tabpfn_extensions/CP_missing_data/CP_missing_data.py

4c0d9c6

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Update based on gemini-code-assist

73ef781

Update tests/test_CP_missing_data.py

57bcccc

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Tidy up the code

512cf47

Update CP_missing_data.py

07ee7e9

Make seed optional and update name change of internal function in pipeline.

Add the masking of nested columns

22b1685

adrian-prior requested review from bejaeger and removed request for adrian-prior January 7, 2026 10:29

bejaeger requested changes Jan 7, 2026

View reviewed changes

	m_fit = TabPFNRegressor()
	m_fit = TabPFNRegressor(**getattr(self, 'tabpfn_kwargs', {}))


		TabPFN: Fitted TabPFNRegressor model.

		X_new : matrix-like of shape (n_samples, n_predictors)


		"""

		def __init__(self, X_train, Y_train, quantiles, val_size, seed=None):

Add CP_MDA-exact algorithm for correct uncertainty estimates with missing data. #231

Are you sure you want to change the base?

Add CP_MDA-exact algorithm for correct uncertainty estimates with missing data. #231

Conversation

Fdvanleeuwen commented Jan 2, 2026

Uh oh!

chatgpt-codex-connector bot commented Jan 2, 2026

Uh oh!

CLAassistant commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Jan 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bejaeger left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLAassistant commented Jan 2, 2026 •

edited

Loading