[SPARK-53803][ML][Feature] Added ArimaRegression for time series forecasting in MLlib #52519
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This pull request adds a new feature called ArimaRegression to Spark MLlib under org.apache.spark.ml.regression.
It brings the ARIMA (AutoRegressive Integrated Moving Average) model for one-variable (univariate) time series forecasting, along with a matching model class ArimaRegressionModel.
The update includes:
Scala code for ArimaRegression and ArimaRegressionModel
Support for ARIMA parameters: p, d, and q
PySpark API bindings for both classes
Unit tests in Scala and Python
Model save/load support using MLWritable and MLReadable
Example usage in examples/ml/ArimaRegressionExample.scala
Why are the changes needed?
Currently, Spark MLlib does not have built-in tools for time-series forecasting.
ARIMA is one of the most common models for predicting trends in time-based data.
Adding this feature allows Spark users to perform forecasting directly within MLlib, without needing outside Python libraries. It also makes Spark’s machine learning toolkit more complete.
Does this PR introduce any user-facing change?
Yes.
New APIs are available in both Scala and Python:
org.apache.spark.ml.regression.ArimaRegression
org.apache.spark.ml.regression.ArimaRegressionModel
pyspark.ml.regression.ArimaRegression
pyspark.ml.regression.ArimaRegressionModel
These follow standard Spark ML APIs and work with Pipelines, ParamMaps, save/load, and transform().
How was this patch tested?
Tests were added in:
Scala (ArimaRegressionSuite.scala) for:
Model fitting and transforming
Parameter defaults and setters
Save/load functions
Python (test_regression.py) for PySpark interface
Manual testing was also done in both:
spark-shell (Scala)
pyspark (Python)
Manual Tested
Scala:
import org.apache.spark.ml.regression.ArimaRegression
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().appName("ArimaRegressionExample").getOrCreate()
import spark.implicits._
val data = Seq(100.0, 102.5, 101.0, 104.0, 107.5, 110.0).toDF("value")
val arima = new ArimaRegression()
.setP(1)
.setD(1)
.setQ(1)
val model = arima.fit(data)
val forecast = model.transform(data)
forecast.show(false)
Python:
from pyspark.ml.regression import ArimaRegression
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("ArimaRegressionExample").getOrCreate()
data = [(100.0,), (102.5,), (101.0,), (104.0,), (107.5,), (110.0,)]
df = spark.createDataFrame(data, ["value"])
arima = ArimaRegression(p=1, d=1, q=1)
model = arima.fit(df)
forecast = model.transform(df)
forecast.show(truncate=False)
Predictions and output schema were checked for correctness
Was this patch authored or co-authored using generative AI tooling?
No.