Batch backend OOMs when collecting large numbers of partitions

### What happened?

In the ServiceBackend, we submit a job group and wait for all jobs to complete before gathering partition results. We use a `ThreadPoolExecutor` to read `MaxAvailableGcsConnections` partition results at a time:

https://github.com/hail-is/hail/blob/99ae33b97f47084a1739021d221d82c8d7437150/hail/hail/src/is/hail/backend/service/ServiceBackend.scala#L293-L306

That value is currently defined at

https://github.com/hail-is/hail/blob/99ae33b97f47084a1739021d221d82c8d7437150/hail/hail/src/is/hail/backend/service/ServiceBackend.scala#L35

and triggers the following memory profile:

<img width="400" height="244" alt="Image" src="https://github.com/user-attachments/assets/d35bb973-d334-437c-8e58-4838105722f3" />

### Relevant zulip thread

[#Hail Batch support > Combiner Driver OOM on 0.2.137](https://hail.zulipchat.com/#narrow/channel/223457-Hail-Batch-support/topic/Combiner.20Driver.20OOM.20on.200.2E2.2E137/with/574611843)

### Version

0.2.137


	jobGroup.state match {
	case Success =>
	val (failures, successes) =
	Await.result(readPartitionOutputs(todo.indices), Duration.Inf).partitionMap(
	identity
	)

	if (failures.nonEmpty)
	logger.error(
	f"Job group ${jobGroup.job_group_id} in batch ${jobGroup.batch_id} " +
	f"completed successfully yet found errors in partition outputs."
	)

	(failures.headOption, successes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch backend OOMs when collecting large numbers of partitions #15288

What happened?

Relevant zulip thread

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Batch backend OOMs when collecting large numbers of partitions #15288

Description

What happened?

Relevant zulip thread

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions