Skip to content

wide_to_long with NaNs in i argument #31976

@MalteMax

Description

@MalteMax

Code and output:

import numpy as np
import pandas as pd
df1 = pd.DataFrame({'id' : ['a', np.NaN, 'c'],
                   'A_1999' : [1, 2, 3],
                   'A_2000' : [5, 6, 7]})

df1
    id  A_1999  A_2000
0    a       1       5
1  NaN       2       6
2    c       3       7

df2 = pd.wide_to_long(df1,
                      stubnames = ['A'],
                      sep = '_',
                      i = 'id',
                      j = 'year').reset_index()

df2
  id  year  A
0  a  1999  1
1  a  1999  2
2  c  1999  3
3  a  2000  5
4  a  2000  6
5  c  2000  7

Problem description

df1 has one NaN in the id column. When applying wide_to_long to df1, wide_to_long seems to fill the NaNs from df1 with the preceding value from the df1 dataframe (namely, a).

Why might this be a problem: instances of id == 'a' in df1 are not the same as in df2 (because the NaNs from df1 have been replaced by a in df2):

df1.loc[df1['id'] == 'a',]

  id  A_1999  A_2000
0  a       1       5

versus:

df2.loc[df2['id'] == 'a',]

  id  year  A
0  a  1999  1
1  a  1999  2
3  a  2000  5
4  a  2000  6

Expected Output

I would expect wide_to_long to either

  1. throw an error that the column that is passed to its i argument contains NaNs, or
  2. preserve the NaN case in the second dataframe (not sure whether this is desirable), i.e.:
    id  year  A
0    a  1999  1
1  NaN  1999  2
2    c  1999  3
3    a  2000  5
4  NaN  2000  6
5    c  2000  7

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None

pandas : 1.0.1
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 45.2.0.post20200210
Cython : 0.29.15
pytest : 5.3.5
hypothesis : 5.4.1
sphinx : 1.8.5
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.5.0
html5lib : 1.0.1
pymysql : None
psycopg2 : 2.8.3 (dt dec pq3 ext lo64)
jinja2 : 2.11.1
IPython : 7.12.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : None
gcsfs : None
lxml.etree : 4.5.0
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : 1.3.13
tables : 3.6.1
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.43.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    BugMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolateReshapingConcat, Merge/Join, Stack/Unstack, Explode

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions