Understanding the behaviour of Modin #6226
Replies: 1 comment 2 replies
-
Hi @overseek944!
We read each part of the csv file into a temporary buffer, and then pass that buffer as input to the read function of pandas itself. That is, at the peak moment, we can have on each process both a buffer and a pandas dataframe created from it, which can be roughly estimated as 2 times more memory. This is also true for reading json files if they are created with
I believe that in this case the reason is only a lack of RAM. Therefore, you need to either reduce the number of files that you use, or increase the amount of RAM on the machine. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Data details
Number of Files - 5000+ JSON files
Every file will be of size around 70 MBs
total size - 350 GBs
Instance details
type - ml.m5.24xlarge
memory - 386 Gbs
CPUs - 96
Hi Team,
I want to understand what can be the potential reasons for this failure and how can this be fixed?
I am trying this in TrainingJob where I am using SKLearnProcessor. This is not distributed so I am just using 1 instance type of ml.m5.24xlarge.
Reference -
https://stackoverflow.com/questions/76043804/ray-workers-being-killed-because-of-oom-pressure
I gone through the above post but still I am not able to understand how it can take upto 2x memory overhead.
Beta Was this translation helpful? Give feedback.
All reactions