Is it possible to sync metric state this of type List[str]? #2106

bmosaicml · 2023-09-25T22:10:13Z

bmosaicml
Sep 25, 2023

I am trying to write a custom metric that maintains some state that is a List[str]. I want to be able to sync across ranks and concatenate the lists belonging to each rank. Reading through sync_dist it's unclear to me where such a synchronization would occur since by default function being applied is gather_all_tensors and there wouldn't be any tensors in the lists.

Is my understanding correct? Is there a different dist_sync_fn I could use to ensure correct syncing of non tensor lists?

Boltzmachine · 2025-01-30T14:06:33Z

Boltzmachine
Jan 30, 2025

Hi anyone could help?

0 replies

Borda · 2026-03-18T15:07:19Z

Borda
Mar 18, 2026
Maintainer

Short answer: TorchMetrics' built-in distributed sync only supports tensors — List[str] won't work with add_state + dist_reduce_fx.

Why: under the hood, sync() calls torch.distributed.all_gather which operates on tensors. There's no built-in dist_reduce_fx for string lists.

Workaround — encode strings as tensors:

import torch
from torchmetrics import Metric
from torchmetrics.utilities import dim_zero_cat


class MetricWithStrings(Metric):
    def __init__(self, max_len=256, **kwargs):
        super().__init__(**kwargs)
        self.max_len = max_len
        # Store encoded strings as padded int tensors
        self.add_state("encoded_strs", default=[], dist_reduce_fx="cat")

    def _encode(self, s: str) -> torch.Tensor:
        encoded = torch.zeros(self.max_len, dtype=torch.long)
        chars = torch.tensor([ord(c) for c in s[:self.max_len]], dtype=torch.long)
        encoded[:len(chars)] = chars
        return encoded

    def _decode(self, t: torch.Tensor) -> str:
        return "".join(chr(c) for c in t.tolist() if c != 0)

    def update(self, strings: list[str]) -> None:
        for s in strings:
            self.encoded_strs.append(self._encode(s).unsqueeze(0))

    def compute(self) -> list[str]:
        all_encoded = dim_zero_cat(self.encoded_strs)  # (N, max_len)
        return [self._decode(row) for row in all_encoded]

Alternative — gather manually in compute():

If you'd rather keep raw Python lists and sync them yourself:

import torch.distributed as dist
import pickle

def gather_strings(local_strings: list[str]) -> list[str]:
    data = pickle.dumps(local_strings)
    tensor = torch.ByteTensor(list(data)).cuda()
    size = torch.tensor([len(data)], device="cuda")
    sizes = [torch.zeros_like(size) for _ in range(dist.get_world_size())]
    dist.all_gather(sizes, size)
    max_size = max(s.item() for s in sizes)
    padded = torch.zeros(max_size, dtype=torch.uint8, device="cuda")
    padded[:len(data)] = tensor
    gathered = [torch.zeros_like(padded) for _ in range(dist.get_world_size())]
    dist.all_gather(gathered, padded)
    result = []
    for g, s in zip(gathered, sizes):
        result.extend(pickle.loads(bytes(g[:s.item()].cpu().tolist())))
    return result

Then call this inside your metric's compute() if dist.is_initialized().

Docs: Implementing Custom Metrics

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is it possible to sync metric state this of type List[str]? #2106

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Is it possible to sync metric state this of type List[str]? #2106

Uh oh!

bmosaicml Sep 25, 2023

Replies: 2 comments

Uh oh!

Boltzmachine Jan 30, 2025

Uh oh!

Borda Mar 18, 2026 Maintainer

bmosaicml
Sep 25, 2023

Boltzmachine
Jan 30, 2025

Borda
Mar 18, 2026
Maintainer