Unexpectedly high memory usage for large models.

When testing how well the `LogisticGlm` model scales with a large toy data set, I am finding on my local machine (16 GB RAM) that I hit out of memory errors even for fairly tiny problem sizes.

Here is some example code to make a toy logistic regression:

```
import breeze.linalg.{DenseMatrix, DenseVector}
import scalaglm.{Glm, LogisticGlm}

object glm extends App {

  // Helper function to map synthetically generated data into
  // training labels of a logistic regression.
  def logistic_fn(x: Double): Double = {
    1.0 / (1.0 + math.exp(-x))
  }

  def fit_logistic(): Glm = {
    // Parameters for creating synthetic data
    val r = new scala.util.Random(0)
    val normal = breeze.stats.distributions.Gaussian(0, 1)

    // Define problem size num_observations x num_features
    val num_observations = 1000000
    val num_features = 50

    val beta = DenseVector.rand(num_features) :* 5.0
    val names = for (i <- 1 to num_features) yield "var_%d".format(i)
    println("True coefficients:")
    println(beta(0 to 10))

    // Create synthetic logistic regression data set.
    val x = DenseMatrix.rand(num_observations, num_features, normal)
    x(::, 0) := 1.0
    val true_logits = x * beta
    val y = true_logits map logistic_fn map {p_i => (if (r.nextDouble < p_i) 1.0 else 0.0)}

    val t1 = System.nanoTime
    val g = Glm(y, x, names, LogisticGlm, addIntercept=false, its=1000)
    println("Elapsed %4.2f for training model".format((System.nanoTime - t1) / 1e9d))
    return g
  }
}
```

With this problem size (1 million observations for 50 features), I immediately get an OOM error:

```
scala> val g = glm.fit_logistic()
True coefficients:
DenseVector(2.78135510778451, 3.6818164882958326, 3.4840289537745948, 4.912012391491977, 2.907467492064324, 0.7532367248769811, 4.496847165217405, 0.20064910613956877, 4.855909891445109, 0.6049146229107971, 4.8162668734131895)
Aug 02, 2018 11:03:48 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/0p/vx1f5tn93z1dc8pzk21g5nx40000gn/T/jniloader2218725777137246063netlib-native_system-osx-x86_64.jnilib
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:153)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:151)
  at breeze.linalg.DenseMatrix$.zeros(DenseMatrix.scala:345)
  at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2(DenseMatrix.scala:823)
  at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2$adapted(DenseMatrix.scala:820)
  at breeze.linalg.DenseMatrix$$anon$33$$Lambda$5324/324878705.apply(Unknown Source)
  at scala.collection.immutable.Range.foreach(Range.scala:156)
  at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:820)
  at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:817)
  at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:91)
  at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:89)
  at breeze.linalg.ImmutableNumericOps.$times(NumericOps.scala:149)
  at breeze.linalg.ImmutableNumericOps.$times$(NumericOps.scala:148)
  at breeze.linalg.BroadcastedColumns.$times(BroadcastedColumns.scala:30)
  at scalaglm.Irls$.IRLS(Glm.scala:243)
  at scalaglm.Glm.<init>(Glm.scala:87)
  at glm$.fit_logistic(glm.scala:30)
  ... 15 elided
```

This is a fairly small problem instance. If I generate the data set with `numpy` for example and serialize to a binary file on disk, it is less than 5 GB. For example, there is no trouble loading this data and fitting the model (even with the standard error calculations) in the `statsmodels` or `scikit-learn` libraries for Python.

What are the root causes for such unexpectedly high memory usage in scala-glm?

A secondary question is how to monitor convergence for this large data. I can increase the iterations, but there is no feedback-per-iteration during model fitting to give an update on whether the fit seems to be converging or not.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unexpectedly high memory usage for large models. #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Unexpectedly high memory usage for large models. #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions