Skip to content

BUG: linear model doens't work on out-of-sample data #7560

@MatthiVH

Description

@MatthiVH

Describe the issue:

I want to fit a linear model to data using Bayesian inference technique. After training a model, I want to test its performance on new data and that's where the problem occurs. I don't seem to be able to set a new dataset and run the model on that one.

Update: I can set a new dataset, but only if it has the same length as the original one. So anything other than a 50-50 split doesn't work for the data or a workaround to make an 80-20 split into a 50-50 split.
So in the script below that contains the error, you'll see that when you set 160 points or make a 50-50 split it'll work, but not by setting any other test shape.

Reproduceable code example:

import numpy as np
import pymc as pm
import arviz as az
import matplotlib.pyplot as plt

def run_model():
    # Generate synthetic data
    np.random.seed(42)
    x = np.linspace(0, 10, 100)    
    a_true = 2.5  # True slope
    b_true = 1.0  # True intercept
    y_true = a_true * x + b_true
    y = y_true + np.random.normal(0, 1, size=x.size)  # Add some noise

    # Split into training and test sets
    x_train, x_test = x[:80], x[80:]
    y_train, y_test = y[:80], y[80:]

    print(x_train.shape)
    print(y_train.shape)

    # Define and fit the model
    with pm.Model() as linear_model:
        # Define x as a pm.Data variable to allow updating with pm.set_data
        x_shared = pm.Data("x", x_train)

        # Priors for slope and intercept
        a = pm.Normal("a", mu=0, sigma=10)
        b = pm.Normal("b", mu=0, sigma=10)
        sigma = pm.HalfNormal("sigma", sigma=1)

        # Expected value of y
        mu = a * x_shared + b

        # Likelihood
        y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_train)

        # Sample from the posterior
        trace = pm.sample(1000, tune=1000, return_inferencedata=True, chains=1)

    # Predict on training data
    with linear_model:
        pm.set_data({"x": x_train})  # Update data to training
        post_pred_train = pm.sample_posterior_predictive(trace).posterior_predictive
        y_pred_samples_train = post_pred_train["y_obs"]
        y_hat_train = y_pred_samples_train.mean(dim='draw').mean(dim='chain')

    # Predict on test data
    with linear_model:
        pm.set_data({"x": x_test})  # Update data to testing
        post_pred_test = pm.sample_posterior_predictive(trace).posterior_predictive
        y_pred_samples_test = post_pred_test["y_obs"]
        y_hat_test = y_pred_samples_test.mean(dim='draw').mean(dim='chain')


# Only execute if run as the main module
if __name__ == '__main__':
    run_model()

Error message:

Traceback (most recent call last):
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\compile\function\types.py", line 959, in __call__
    self.vm()
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\graph\op.py", line 524, in rval
    r = p(n, [x[0] for x in i], o)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\tensor\random\op.py", line 403, in perform
    smpl_val = self.rng_fn(rng, *([*args, size]))
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\tensor\random\op.py", line 174, in rng_fn
    return getattr(rng, self.name)(*args, **kwargs)
  File "_generator.pyx", line 1146, in numpy.random._generator.Generator.normal
  File "_common.pyx", line 600, in numpy.random._common.cont
  File "_common.pyx", line 517, in numpy.random._common.cont_broadcast_2
  File "__init__.pxd", line 741, in numpy.PyArray_MultiIterNew3
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (80,) and arg 1 with shape (20,).
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\IPython\core\interactiveshell.py", line 3548, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-668474bcfdc0>", line 1, in <module>
    runfile('C:\\Users\\mvaho\\Documents\\0_Aalborg_data\\test_pymc.py', wdir='C:\\Users\\mvaho\\Documents\\0_Aalborg_data')
  File "C:\Program Files\JetBrains\PyCharm2023.3.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm2023.3.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\Users\mvaho\Documents\0_Aalborg_data\test_pymc.py", line 86, in <module>
    run_model()
  File "C:\Users\mvaho\Documents\0_Aalborg_data\test_pymc.py", line 46, in run_model
    post_pred_test = pm.sample_posterior_predictive(trace)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pymc\sampling\forward.py", line 885, in sample_posterior_predictive
    values = sampler_fn(**param)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pymc\util.py", line 395, in wrapped
    return core_function(**input_point)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\compile\function\types.py", line 972, in __call__
    raise_with_op(
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\link\utils.py", line 524, in raise_with_op
    raise exc_value.with_traceback(exc_trace)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\compile\function\types.py", line 959, in __call__
    self.vm()
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\graph\op.py", line 524, in rval
    r = p(n, [x[0] for x in i], o)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\tensor\random\op.py", line 403, in perform
    smpl_val = self.rng_fn(rng, *([*args, size]))
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pytensor\tensor\random\op.py", line 174, in rng_fn
    return getattr(rng, self.name)(*args, **kwargs)
  File "_generator.pyx", line 1146, in numpy.random._generator.Generator.normal
  File "_common.pyx", line 600, in numpy.random._common.cont
  File "_common.pyx", line 517, in numpy.random._common.cont_broadcast_2
  File "__init__.pxd", line 741, in numpy.PyArray_MultiIterNew3
ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (80,) and arg 1 with shape (20,).
Apply node that caused the error: normal_rv{"(),()->()"}(RNG(<Generator(PCG64) at 0x1F23323B5A0>), [80], Composite{((i0 * i1) + i2)}.0, ExpandDims{axis=0}.0)
Toposort index: 4
Inputs types: [RandomGeneratorType, TensorType(int64, shape=(1,)), TensorType(float64, shape=(None,)), TensorType(float64, shape=(1,))]
Inputs shapes: ['No shapes', (1,), (20,), (1,)]
Inputs strides: ['No strides', (8,), (8,), (8,)]
Inputs values: [Generator(PCG64) at 0x1F23323B5A0, array([80], dtype=int64), 'not shown', array([0.97974278])]
Outputs clients: [[output[1](normal_rv{"(),()->()"}.0)], [output[0](y_obs)]]
Backtrace when the node is created (use PyTensor flag traceback__limit=N to make it longer):
  File "<ipython-input-2-668474bcfdc0>", line 1, in <module>
    runfile('C:\\Users\\mvaho\\Documents\\0_Aalborg_data\\test_pymc.py', wdir='C:\\Users\\mvaho\\Documents\\0_Aalborg_data')
  File "C:\Program Files\JetBrains\PyCharm2023.3.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm2023.3.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:\Users\mvaho\Documents\0_Aalborg_data\test_pymc.py", line 86, in <module>
    run_model()
  File "C:\Users\mvaho\Documents\0_Aalborg_data\test_pymc.py", line 33, in run_model
    y_obs = pm.Normal("y_obs", mu=mu, sigma=sigma, observed=y_train)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pymc\distributions\distribution.py", line 508, in __new__
    rv_out = cls.dist(*args, **kwargs)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pymc\distributions\continuous.py", line 508, in dist
    return super().dist([mu, sigma], **kwargs)
  File "C:\Users\mvaho\AppData\Roaming\Python\Python310\site-packages\pymc\distributions\distribution.py", line 577, in dist
    rv_out = cls.rv_op(*dist_params, size=create_size, **kwargs)
HINT: Use the PyTensor flag `exception_verbosity=high` for a debug print-out and storage map footprint of this Apply node.

PyMC version information:

pymc 5.17.0
az 0.20.0
Windows 11

pymc install with pip, my pymc works, just not for pm.set_data()

Context for the issue:

When fitting models to data, the procedure always requires to split data in at least a training and a test dataset to test the performance of the actual model. Some even say that one needs a training, an evaluation and a test set. For this procedure to work, one needs to be able to run the model on the out-of-sample test data. This doesn't seem to be possible and if it isn't possible, one can't verify the actual performance of the model in a statistically sound manner.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions