Skip to content

Conversation

@Manas-Dikshit
Copy link
Contributor

This PR fixes the implementation of fit() and transform() in SparkKMeansOperator.java, ensuring correctness in Apache Spark MLlib's KMeans clustering.

Changes
✅ Properly converts JavaRDD<double[]> to Dataset using convertToDataFrame().
✅ Uses the correct features and prediction column names.
✅ Ensures transform() outputs a Tuple2<double[], Integer> for better usability.
✅ Implements predict() method to return only the cluster labels.

Issue Fixed
🔧 Fixes Issue #364: "Support Fit and Transform in SparkKMeansOperator"

Testing
✔ Verified clustering correctness using sample RDD<double[]> input.
✔ Checked that cluster centers and predictions are accurately extracted.

@novatechflow novatechflow requested a review from zkaoudi February 27, 2025 17:52
Copy link
Contributor

@zkaoudi zkaoudi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Manas-Dikshit

@zkaoudi zkaoudi merged commit 1d5736f into apache:main Feb 27, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants