Skip to content

progress indicator?Β #81

@AlJohri

Description

@AlJohri

@owlas @arokem I'm running fci.random_forest_error on a fairly large dataset.

train: 3334431, 200
test: 13703350, 200

(train is smaller after undersampling)

I'm trying to use both the memory_constrained version and the low_memory version (#74).

I ran the memory_constrained version on a m5.24xlarge EC2 instance with 384.0 GiB of memory and 96 vCPUs. I gave it a memory_limit of 100000 MB (100 GB). This only utilized about half of the memory on the instance and it ran for over 48 hours until I just terminated the instance.

I'm currently running the low_memory option on a m5.12xlarge (192 GiB memory, 48 vCPU) which has been running for 15 hours straight and hasn't finished yet. Using top I do see that all of the CPUs are being utilized to 100%.

I have a few questions:

  1. how can I estimate the ideal size of memory_limit? I understand its the max size of the intermediate matrices from the docs but it wasn't clear to me how many intermediate matricies are being created at a time? Is it sequential, i.e. do I just give it the whole RAM?

  2. is the memory_constrained version faster than the low_memory option given large enough memory limit? It wasn't clear to me which one I should be expecting to complete.

  3. Is there any way to show a progress indicator (even if I have to hack in a print for now)? I'd like to know how close I am to completion with jobs that seem to take multiple days to run.

  4. Overall, I'm looking to precompute as much as possible and then run this model on live predictions one at a time. Looking at the code I believe this should be possible- does this sound do-able to you?

Any other advice you can offer would be most helpful.

Thanks!

EDIT: I'm trying the low_memory option again with a memory_limit of 300000 (300 GiB). I believe that limit does indeed appear to be sequential. The memory slowly crawls up to the max, all the cores kick in for a few minutes, and then the memory comes back down again.

Notably, while the memory is slowly filling up, only one core is being. Only when the memory fills up all the way, most of the cores kick into action for about 2 minutes. Then it takes 1 minute for the memory to go back down with a single core being used. The cycle then starts again.

Is there perhaps some optimization that can allow all the cores to be used more efficiently? It seems the majority of time is currently spent waiting until the memory limit is reached and only then some computation occurs.

As mentioned before, the low_memory option, on the other hand, is constantly using all cores at 100%.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions