Skip to content

Conversation

anandexplore
Copy link

@anandexplore anandexplore commented Oct 5, 2025

What changes were proposed in this pull request?

This pull request adds a new feature called ArimaRegression to Spark MLlib under org.apache.spark.ml.regression.
It brings the ARIMA (AutoRegressive Integrated Moving Average) model for one-variable (univariate) time series forecasting, along with a matching model class ArimaRegressionModel.

The update includes:
Scala code for ArimaRegression and ArimaRegressionModel
Support for ARIMA parameters: p, d, and q
PySpark API bindings for both classes
Unit tests in Scala and Python
Model save/load support using MLWritable and MLReadable
Example usage in examples/ml/ArimaRegressionExample.scala

Why are the changes needed?

Currently, Spark MLlib does not have built-in tools for time-series forecasting.
ARIMA is one of the most common models for predicting trends in time-based data.
Adding this feature allows Spark users to perform forecasting directly within MLlib, without needing outside Python libraries. It also makes Spark’s machine learning toolkit more complete.

Does this PR introduce any user-facing change?

Yes.
New APIs are available in both Scala and Python:
org.apache.spark.ml.regression.ArimaRegression
org.apache.spark.ml.regression.ArimaRegressionModel
pyspark.ml.regression.ArimaRegression
pyspark.ml.regression.ArimaRegressionModel

These follow standard Spark ML APIs and work with Pipelines, ParamMaps, save/load, and transform().

How was this patch tested?

Tests were added in:

Scala (ArimaRegressionSuite.scala) for:
Model fitting and transforming
Parameter defaults and setters
Save/load functions
Python (test_regression.py) for PySpark interface
Manual testing was also done in both:
spark-shell (Scala)
pyspark (Python)

Manual Tested
Scala:
import org.apache.spark.ml.regression.ArimaRegression
import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder().appName("ArimaRegressionExample").getOrCreate()
import spark.implicits._
val data = Seq(100.0, 102.5, 101.0, 104.0, 107.5, 110.0).toDF("value")
val arima = new ArimaRegression()
.setP(1)
.setD(1)
.setQ(1)
val model = arima.fit(data)
val forecast = model.transform(data)
forecast.show(false)

Python:
from pyspark.ml.regression import ArimaRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ArimaRegressionExample").getOrCreate()
data = [(100.0,), (102.5,), (101.0,), (104.0,), (107.5,), (110.0,)]
df = spark.createDataFrame(data, ["value"])
arima = ArimaRegression(p=1, d=1, q=1)
model = arima.fit(df)
forecast = model.transform(df)
forecast.show(truncate=False)
Predictions and output schema were checked for correctness

Was this patch authored or co-authored using generative AI tooling?

No.

@anandexplore anandexplore changed the title [SPARK-53803][ML][Feature] Add ArimaRegression for time series forecasting in MLlib [SPARK-53803][ML][Feature] Added ArimaRegression for time series forecasting in MLlib Oct 5, 2025
@anandexplore
Copy link
Author

anandexplore commented Oct 5, 2025

@sryza Could you review this commit
CC @HyukjinKwon @hvanhovell

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant