Skip to content

Unexpectedly high memory usage for large models. #1

@espears4sq

Description

@espears4sq

When testing how well the LogisticGlm model scales with a large toy data set, I am finding on my local machine (16 GB RAM) that I hit out of memory errors even for fairly tiny problem sizes.

Here is some example code to make a toy logistic regression:

import breeze.linalg.{DenseMatrix, DenseVector}
import scalaglm.{Glm, LogisticGlm}

object glm extends App {

  // Helper function to map synthetically generated data into
  // training labels of a logistic regression.
  def logistic_fn(x: Double): Double = {
    1.0 / (1.0 + math.exp(-x))
  }

  def fit_logistic(): Glm = {
    // Parameters for creating synthetic data
    val r = new scala.util.Random(0)
    val normal = breeze.stats.distributions.Gaussian(0, 1)

    // Define problem size num_observations x num_features
    val num_observations = 1000000
    val num_features = 50

    val beta = DenseVector.rand(num_features) :* 5.0
    val names = for (i <- 1 to num_features) yield "var_%d".format(i)
    println("True coefficients:")
    println(beta(0 to 10))

    // Create synthetic logistic regression data set.
    val x = DenseMatrix.rand(num_observations, num_features, normal)
    x(::, 0) := 1.0
    val true_logits = x * beta
    val y = true_logits map logistic_fn map {p_i => (if (r.nextDouble < p_i) 1.0 else 0.0)}

    val t1 = System.nanoTime
    val g = Glm(y, x, names, LogisticGlm, addIntercept=false, its=1000)
    println("Elapsed %4.2f for training model".format((System.nanoTime - t1) / 1e9d))
    return g
  }
}

With this problem size (1 million observations for 50 features), I immediately get an OOM error:

scala> val g = glm.fit_logistic()
True coefficients:
DenseVector(2.78135510778451, 3.6818164882958326, 3.4840289537745948, 4.912012391491977, 2.907467492064324, 0.7532367248769811, 4.496847165217405, 0.20064910613956877, 4.855909891445109, 0.6049146229107971, 4.8162668734131895)
Aug 02, 2018 11:03:48 AM com.github.fommil.jni.JniLoader liberalLoad
INFO: successfully loaded /var/folders/0p/vx1f5tn93z1dc8pzk21g5nx40000gn/T/jniloader2218725777137246063netlib-native_system-osx-x86_64.jnilib
java.lang.OutOfMemoryError: Java heap space
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:153)
  at scala.reflect.ManifestFactory$DoubleManifest.newArray(Manifest.scala:151)
  at breeze.linalg.DenseMatrix$.zeros(DenseMatrix.scala:345)
  at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2(DenseMatrix.scala:823)
  at breeze.linalg.DenseMatrix$$anon$33.$anonfun$apply$2$adapted(DenseMatrix.scala:820)
  at breeze.linalg.DenseMatrix$$anon$33$$Lambda$5324/324878705.apply(Unknown Source)
  at scala.collection.immutable.Range.foreach(Range.scala:156)
  at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:820)
  at breeze.linalg.DenseMatrix$$anon$33.apply(DenseMatrix.scala:817)
  at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:91)
  at breeze.linalg.BroadcastedColumns$$anon$4.apply(BroadcastedColumns.scala:89)
  at breeze.linalg.ImmutableNumericOps.$times(NumericOps.scala:149)
  at breeze.linalg.ImmutableNumericOps.$times$(NumericOps.scala:148)
  at breeze.linalg.BroadcastedColumns.$times(BroadcastedColumns.scala:30)
  at scalaglm.Irls$.IRLS(Glm.scala:243)
  at scalaglm.Glm.<init>(Glm.scala:87)
  at glm$.fit_logistic(glm.scala:30)
  ... 15 elided

This is a fairly small problem instance. If I generate the data set with numpy for example and serialize to a binary file on disk, it is less than 5 GB. For example, there is no trouble loading this data and fitting the model (even with the standard error calculations) in the statsmodels or scikit-learn libraries for Python.

What are the root causes for such unexpectedly high memory usage in scala-glm?

A secondary question is how to monitor convergence for this large data. I can increase the iterations, but there is no feedback-per-iteration during model fitting to give an update on whether the fit seems to be converging or not.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions