-
Notifications
You must be signed in to change notification settings - Fork 262
Milestone
Description
What happened?
In the ServiceBackend, we submit a job group and wait for all jobs to complete before gathering partition results. We use a ThreadPoolExecutor to read MaxAvailableGcsConnections partition results at a time:
hail/hail/hail/src/is/hail/backend/service/ServiceBackend.scala
Lines 293 to 306 in 99ae33b
| jobGroup.state match { | |
| case Success => | |
| val (failures, successes) = | |
| Await.result(readPartitionOutputs(todo.indices), Duration.Inf).partitionMap( | |
| identity | |
| ) | |
| if (failures.nonEmpty) | |
| logger.error( | |
| f"Job group ${jobGroup.job_group_id} in batch ${jobGroup.batch_id} " + | |
| f"completed successfully yet found errors in partition outputs." | |
| ) | |
| (failures.headOption, successes) |
That value is currently defined at
| val MaxAvailableGcsConnections = 1000 |
and triggers the following memory profile:
Relevant zulip thread
#Hail Batch support > Combiner Driver OOM on 0.2.137
Version
0.2.137
Reactions are currently unavailable