-
Notifications
You must be signed in to change notification settings - Fork 4
Description
When testing how well the LogisticGlm model scales with a large toy data set, I am finding on my local machine (16 GB RAM) that I hit out of memory errors even for fairly tiny problem sizes.
Here is some example code to make a toy logistic regression:
import breeze.linalg.{DenseMatrix, DenseVector}
import scalaglm.{Glm, LogisticGlm}
object glm extends App {
// Helper function to map synthetically generated data into
// training labels of a logistic regression.
def logistic_fn(x: Double): Double = {
1.0 / (1.0 + math.exp(-x))
}
def fit_logistic(): Glm = {
// Parameters for creating synthetic data
val r = new scala.util.Random(0)
val normal = breeze.stats.distributions.Gaussian(0, 1)
// Define problem size num_observations x num_features
val num_observations = 1000000
val num_features = 50
val beta = DenseVector.rand(num_features) :* 5.0
val names = for (i <- 1 to num_features) yield "var_%d".format(i)
println("True coefficients:")
println(beta(0 to 10))
// Create synthetic logistic regression data set.
val x = DenseMatrix.rand(num_observations, num_features, normal)
x(::, 0) := 1.0
val true_logits = x * beta
val y = true_logits map logistic_fn map {p_i => (if (r.nextDouble < p_i) 1.0 else 0.0)}
val t1 = System.nanoTime
val g = Glm(y, x, names, LogisticGlm, addIntercept=false, its=1000)
println("Elapsed %4.2f for training model".format((System.nanoTime - t1) / 1e9d))
return g
}
}
With this problem size (1 million observations for 50 features), I immediately get an OOM error:
scala> val g = glm.fit_logistic()
True coefficients:
DenseVector(2.78135510778451, 3.6818164882958326, 3.4840289537745948, 4.912012391491977, 2.907467492064324, 0.7532367248769811, 4.496847165217405, 0.20064910613956877, 4.855909891445109, 0.6049146229107971, 4.8162668734131895)
Aug 02, 2018 11:03:48 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/0p/vx1f5tn93z1dc8pzk21g5nx40000gn/T/jniloader2218725777137246063netlib-native_system-osx-x86_64.jnilib
java.lang.OutOfMemoryError: Java heap space
at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:153)
at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:151)
at breeze.linalg.DenseMatrix$.zeros(DenseMatrix.scala:345)
at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2(DenseMatrix.scala:823)
at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2$adapted(DenseMatrix.scala:820)
at breeze.linalg.DenseMatrix$$anon$33$$Lambda$5324/324878705.apply(Unknown Source)
at scala.collection.immutable.Range.foreach(Range.scala:156)
at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:820)
at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:817)
at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:91)
at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:89)
at breeze.linalg.ImmutableNumericOps.$times(NumericOps.scala:149)
at breeze.linalg.ImmutableNumericOps.$times$(NumericOps.scala:148)
at breeze.linalg.BroadcastedColumns.$times(BroadcastedColumns.scala:30)
at scalaglm.Irls$.IRLS(Glm.scala:243)
at scalaglm.Glm.<init>(Glm.scala:87)
at glm$.fit_logistic(glm.scala:30)
... 15 elided
This is a fairly small problem instance. If I generate the data set with numpy for example and serialize to a binary file on disk, it is less than 5 GB. For example, there is no trouble loading this data and fitting the model (even with the standard error calculations) in the statsmodels or scikit-learn libraries for Python.
What are the root causes for such unexpectedly high memory usage in scala-glm?
A secondary question is how to monitor convergence for this large data. I can increase the iterations, but there is no feedback-per-iteration during model fitting to give an update on whether the fit seems to be converging or not.