[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator by Manas-Dikshit · Pull Request #508 · apache/wayang

Manas-Dikshit · 2025-02-27T17:44:38Z

This PR fixes the implementation of fit() and transform() in SparkKMeansOperator.java, ensuring correctness in Apache Spark MLlib's KMeans clustering.

Changes
✅ Properly converts JavaRDD<double[]> to Dataset using convertToDataFrame().
✅ Uses the correct features and prediction column names.
✅ Ensures transform() outputs a Tuple2<double[], Integer> for better usability.
✅ Implements predict() method to return only the cluster labels.

Issue Fixed
🔧 Fixes Issue #364: "Support Fit and Transform in SparkKMeansOperator"

Testing
✔ Verified clustering correctness using sample RDD<double[]> input.
✔ Checked that cluster centers and predictions are accurately extracted.

zkaoudi

Thank you @Manas-Dikshit

Update SparkKMeansOperator.java

5509aac

novatechflow requested a review from zkaoudi February 27, 2025 17:52

zkaoudi approved these changes Feb 27, 2025

View reviewed changes

zkaoudi merged commit 1d5736f into apache:main Feb 27, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator#508

[SPARK] Implement Correct fit() and transform() in SparkKMeansOperator#508
zkaoudi merged 1 commit intoapache:mainfrom
Manas-Dikshit:main

Manas-Dikshit commented Feb 27, 2025

Uh oh!

zkaoudi left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Manas-Dikshit commented Feb 27, 2025

Uh oh!

zkaoudi left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants