Skip to content

potential bug in prediction of MIXL models #14

@alvarogutyerrez

Description

@alvarogutyerrez

Dear Cristian Arteaga,
First of all, thank you for developing such a great package. I have been using it extensively and I think it works just great!

However, I think I found a weird behavior (most likely a bug) when predicting probabilities from a MIXL model.

I was trying to re-create the Log-likelihood (LL) of the model from its predicted probabilities using equation (6) in your article:

image

Unfortunately, noticed that I was not able to replicate them, and the differences were rather large. (see the first chunk of code below). Additionally, weirdly enough, when using only 1 draw, I am able to replicate the LL value. (see the second chunk of code below)

I haven't had time to go through your source code, but I think there is something wrong with the .predict() method for the MIXL models, since, for MNL models, I checked that my method of retrieving the LL works just fine (not included here).

First chunk

#%%
# 
import pandas as pd
import numpy as np
from xlogit.utils import wide_to_long
from xlogit import MixedLogit
# from your example
df_wide = pd.read_table("http://transp-or.epfl.ch/data/swissmetro.dat", sep='\t')
# Keep only observations for commute and business purposes that contain known choices
df_wide = df_wide[(df_wide['PURPOSE'].isin([1, 3]) & (df_wide['CHOICE'] != 0))]
df_wide['custom_id'] = np.arange(len(df_wide))  # Add unique identifier
df_wide['CHOICE'] = df_wide['CHOICE'].map({1: 'TRAIN', 2:'SM', 3: 'CAR'})
df = wide_to_long(df_wide, id_col='custom_id', alt_name='alt', sep='_',
                  alt_list=['TRAIN', 'SM', 'CAR'], empty_val=0,
                  varying=['TT', 'CO', 'HE', 'AV', 'SEATS'], alt_is_prefix=True)
df['ASC_TRAIN'] = np.ones(len(df))*(df['alt'] == 'TRAIN')
df['ASC_CAR'] = np.ones(len(df))*(df['alt'] == 'CAR')
df['TT'], df['CO'] = df['TT']/100, df['CO']/100  # Scale variables
annual_pass = (df['GA'] == 1) & (df['alt'].isin(['TRAIN', 'SM']))
df.loc[annual_pass, 'CO'] = 0  # Cost zero for pass holders
varnames=['ASC_CAR', 'ASC_TRAIN', 'CO', 'TT']


# Model building 
model = MixedLogit()
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
           avail=df['AV'],
          panels=df["ID"], randvars={'TT': 'n'}, 
          n_draws=100,
          optim_method='L-BFGS-B')


# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                panels=df["ID"], 
                n_draws=100,
                return_proba = True) 
# Recovering the predicted probabilities
pred_proba = predictions[1]

# transform the df['CHOICE'] variable into a dummy variable
chosen = np.array(df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2})).reshape(-1, 1)
# Select the probability of the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Compute the negative log-likelihood
recreated_LL = np.sum(np.log(proba_chosen))
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)

# recreated LL: -5293.024645448918
# model's LL  : -4360.226616589964

Second chunk

# The same again but with 1 draw only.
model.fit(X=df[varnames], 
          y=df['CHOICE'], 
          varnames=varnames,
          alts=df['alt'], 
          ids=df['custom_id'],
           avail=df['AV'],
          panels=df["ID"], randvars={'TT': 'n'}, 
          n_draws=1,
          optim_method='L-BFGS-B')


# Create predictions 
predictions = model.predict(X=df[varnames], 
              varnames=varnames,
                alts=df['alt'], 
                ids=df['custom_id'], 
                avail=df['AV'],
                panels=df["ID"], 
                n_draws=1,
                return_proba = True) 
# Recovering the predicted probabilities
pred_proba = predictions[1]

# transform the df['CHOICE'] variable into a dummy variable
chosen = np.array(df_wide['CHOICE'].map({'TRAIN': 0, 'SM': 1, 'CAR': 2})).reshape(-1, 1)
# Select the probability of the chosen alternative
proba_chosen = np.take_along_axis(pred_proba, chosen, axis=1)
# Compute the negative log-likelihood
recreated_LL = np.sum(np.log(proba_chosen))
print("recreated LL:",recreated_LL)
print("model's LL  :",model.loglikelihood)


#recreated LL: -5331.206129776281
#model's LL  : -5331.206129776281

Thank you in advance!

Álvaro

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions