Skip to content

Koalas.idxmin() is not picking the minimum value from a dataframe, but pandas.idxmin() gives #2225

@nikeshv

Description

@nikeshv

Hi,
I have a koalas dataframe with age and income and I calculated Zscore on age and income and then norms is calculated using age_zscore and income_zscore(new column name is sq_dist). Then I tried to do an idxmin on the new column, but its not giving the minimum value.
I did the same operations on a Pandas dataframe, but it gives the minimum value .

Please find attached the notebook for step by step operations I performed.

cmd1
import databricks.koalas as ks
import pandas as pd
import random

cmd2
#Create Sample dataframe in Koalas
df = ks.DataFrame.from_dict({
'Age': [random.randint(0, 100000) for i in range(100000)],
'Income': [random.randint(0, 100000) for i in range(100000)]
})

print(df.head(5))

cmd3
import scipy.stats as stats
import numpy as np
ks.set_option('compute.ops_on_diff_frames', True)
df['Income_zscore'] = ks.Series(stats.zscore(df['Income'].to_numpy()))
df['Age_zscore'] = ks.Series(stats.zscore(df['Age'].to_numpy()))
df['sq_dist'] = [np.linalg.norm(i) for i in df[['Income_zscore','Age_zscore']].to_numpy()]
ks.set_option('compute.ops_on_diff_frames', False)

cmd4
#display(df)

cmd5
#calculate min of sq_dist
minindex=df['sq_dist'].idxmin()
minindex

cmd6
#display min value of sq_dist
df['sq_dist'].iloc[minindex]

cmd7
df.to_spark().createOrReplaceTempView("koalastable")

cmd8
%sql
select min(sq_dist) from koalastable -- THis doesnt match with the value we got in cmd6

cmd9
#do same operations with Pandas
df_spark = df.to_spark()
stats_array = np.array(df_spark.select('Age', 'Income').collect())
normalized_data = stats.zscore(stats_array, axis=0)
df_pd = pd.DataFrame(data=normalized_data, columns=['Age', 'Income'])
df_pd['sq_dist'] = [np.linalg.norm(i) for i in normalized_data]
df_pd.head(5)

cmd10
minindex_pd=df_pd['sq_dist'].idxmin()
minindex_pd

cmd11
#minimum of sq_dist using Koalas
df_pd['sq_dist'].iloc[minindex_pd]

cmd12
spark.createDataFrame(df_pd).createOrReplaceTempView("pandastable")

cmd13
%sql
select min(sq_dist) from pandastable -- This match with the value we got in cmd11

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions