Skip to content

Commit 4c36ca3

Browse files
itholicHyukjinKwon
authored andcommitted
[SPARK-46065][PS] Refactor (DataFrame|Series).factorize() to use create_map
### What changes were proposed in this pull request? This PR proposes to refactor `(DataFrame|Series).factorize()` to use `create_map`. ### Why are the changes needed? To optimize performance by using official API and also improve the readability. ### Does this PR introduce _any_ user-facing change? No, it's internal refactoring. ### How was this patch tested? The existing CI should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #43970 from itholic/refactor_factorize. Authored-by: Haejoon Lee <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
1 parent a703dac commit 4c36ca3

File tree

1 file changed

+2
-9
lines changed

1 file changed

+2
-9
lines changed

python/pyspark/pandas/base.py

Lines changed: 2 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1672,16 +1672,9 @@ def factorize(
16721672
if len(kvs) == 0: # uniques are all missing values
16731673
new_scol = F.lit(na_sentinel_code)
16741674
else:
1675+
map_scol = F.create_map(*kvs)
16751676
null_scol = F.when(self.isnull().spark.column, F.lit(na_sentinel_code))
1676-
mapped_scol = None
1677-
for i in range(0, len(kvs), 2):
1678-
key = kvs[i]
1679-
value = kvs[i + 1]
1680-
if mapped_scol is None:
1681-
mapped_scol = F.when(self.spark.column == key, value)
1682-
else:
1683-
mapped_scol = mapped_scol.when(self.spark.column == key, value)
1684-
new_scol = null_scol.otherwise(mapped_scol)
1677+
new_scol = null_scol.otherwise(map_scol[self.spark.column])
16851678

16861679
codes = self._with_new_scol(new_scol.alias(self._internal.data_spark_column_names[0]))
16871680

0 commit comments

Comments
 (0)