Skip to content

Batch backend OOMs when collecting large numbers of partitions #15288

@ehigham

Description

@ehigham

What happened?

In the ServiceBackend, we submit a job group and wait for all jobs to complete before gathering partition results. We use a ThreadPoolExecutor to read MaxAvailableGcsConnections partition results at a time:

jobGroup.state match {
case Success =>
val (failures, successes) =
Await.result(readPartitionOutputs(todo.indices), Duration.Inf).partitionMap(
identity
)
if (failures.nonEmpty)
logger.error(
f"Job group ${jobGroup.job_group_id} in batch ${jobGroup.batch_id} " +
f"completed successfully yet found errors in partition outputs."
)
(failures.headOption, successes)

That value is currently defined at

val MaxAvailableGcsConnections = 1000

and triggers the following memory profile:

Image

Relevant zulip thread

#Hail Batch support > Combiner Driver OOM on 0.2.137

Version

0.2.137

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions