Skip to content

Conversation

NeoLegends
Copy link
Member

@NeoLegends NeoLegends commented Sep 19, 2025

Closes #1701

Currently fails due to #1763.

@NeoLegends NeoLegends changed the title Add multiprocessing support to PostprocessingDataset Add multiprocessing support to PostprocessingDataset Sep 19, 2025
@albertz
Copy link
Member

albertz commented Sep 19, 2025

Let's discuss the implementation first before you implement sth. See my comment in #1701. I think we can simply share/reuse the MultiProcDataset code?

@albertz
Copy link
Member

albertz commented Sep 19, 2025

We should also fix #1766 first.

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 19, 2025

I understand it's not yet clear if it's safe to always copy the raw tensor in Tensor __getstate__.

So, for the purposes of this PR, where I want to send TensorDicts across a pickled pipe, do you think it's reasonable to introduce such a scope/contextmanager (as already mentioned in #1541) that configures whether the raw_tensor is included in __getstate__ or not?

With the contextmanager we do not need to know now whether it's always safe to do copy. If it then eventually turns out it's always safe to copy the raw_tensor we can remove the contextmanager implementation again from RETURNN and default to always copying.

@albertz
Copy link
Member

albertz commented Sep 19, 2025

I think we should just fix #1541. I.e. first thing is to understand whether it is maybe safe to just always copy it. Or understand when it is not safe. I would avoid introducing some code complexity (like such a context scope) just because we want to avoid understanding the code.

assert self._dataset is not None
self._dataset.init_seq_order(epoch=epoch, seq_list=seq_list, seq_order=seq_order)
self._data_iter = enumerate(self._build_mapping_iter())
if self._num_workers > 0:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to use a dedicated subclass for the multi-process variant to keep the base implementation more simple and avoid mixing in concerns of process management etc.

WDYT @albertz ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, the result has come out really cleanly separated.

Copy link
Member Author

@NeoLegends NeoLegends Sep 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@albertz see here for why I made a subclass instead. I did not like how the multiprocessing details bled into the existing, straightforward implementation and made it less straightforward. There were class fields that were only used for one of the two cases but not the other, the data iterator would be dynamically either one or the other variant, etc.

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

It seems like manually calling gc.collect() after every 1000 (just an example number I picked, must not be too low to avoid slowing down the processor too much) segments keeps the resident size at ~2GB. So it is a GC issue in a way.

I guess from this finding it's sensible to add a gc.collect() to the other subprocess datasets as well.

Flamegraph WIP

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

Flamegraph w/ GC every 2000 seqs: image

Memory usage is half of what it was before. Is the Python GC even aware of memory pressure? If not, then I think we should add something like this to every dataset.

mem-2767946.html

@albertz
Copy link
Member

albertz commented Sep 29, 2025

Even 2GB is already much more than I would expect. A Python interpreter with Numpy imported takes rss=36.6MB pss=36.1MB uss=36.0MB shared=596.0KB for me.

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

Ah, I've been running with buf_size=200 so far, so some memory usage is expected. With buf_size=1 and frequent GC the memory usage is at about 400MB.

@albertz
Copy link
Member

albertz commented Sep 29, 2025

Ah, I've been running with buf_size=200 so far, so some memory usage is expected. With buf_size=1 and frequent GC the memory usage is at about 400MB.

Buffer size 200 seems way too much. I would expect sth like 1-10 is enough here?

@albertz
Copy link
Member

albertz commented Sep 29, 2025

With buf_size=1 and frequent GC the memory usage is at about 400MB.

This seems still a bit too much to me. Where exactly is it consumed? As said, I would expect more like 40MB.

@NeoLegends
Copy link
Member Author

Buffer size 200 seems way too much. I would expect sth like 1-10 is enough here?

Yeah, maybe. I tend to use larger buffers to make sure the data pipeline never stalls.

This seems still a bit too much to me. Where exactly is it consumed? As said, I would expect more like 40MB.

Hm, maybe. But I also don't think it's that problematic anymore then, maybe digging even deeper into where this memory is spent is "off topic" for this PR? This 400MB is the resident size, and as seen is about 2x the heap size here.

Here is the flamegraph:
image

mem-2769354.html

@NeoLegends
Copy link
Member Author

I'm now looking into the allocation counts, they seem suspiciously high to me. In the 10 minutes of profiling time the process made about 930M allocations, many of which are made by the unpickler. Maybe we have an issue w/ unpickling?

@albertz
Copy link
Member

albertz commented Sep 29, 2025

I'm not sure this is off topic for the PR. For that, we first need to understand where it comes from.

From your flamegraph, I don't really get: where is the mem allocated to, to what objects, where were they created, etc?

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

❯ memray stats -n 20 mem-2769354.profile
📏 Total allocations:
        466772172

📦 Total memory allocated:
        120.993GB

📈 Peak memory usage:
        175.886MB

📊 Histogram of allocation size:
        min: 1.000B
        ------------------------------------------------
        < 5.000B   :    759459 ▇
        < 26.000B  :  67449044 ▇▇▇▇▇
        < 136.000B : 395472412 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇
        < 705.000B :   1863909 ▇
        < 3.636kB  :    379359 ▇
        < 18.739kB :    376007 ▇
        < 96.580kB :    247423 ▇
        < 497.753kB:    179849 ▇
        < 2.565MB  :     44196 ▇
        <=13.221MB :       514 ▇
        ------------------------------------------------
        max: 13.221MB

📂 Allocator type distribution:
         PYMALLOC_MALLOC: 371663783
         PYMALLOC_REALLOC: 94496982
         PYMALLOC_CALLOC: 501135
         MALLOC: 109923
         CALLOC: 349

🥇 Top 20 largest allocating locations (by size):
        - recv:/usr/lib/python3.10/multiprocessing/connection.py:251 -> 41.959GB
        - dumps:/usr/lib/python3.10/multiprocessing/reduction.py:51 -> 28.005GB
        - resample:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/resampy/core.py:140 -> 10.172GB
        - _recv:/usr/lib/python3.10/multiprocessing/connection.py:379 -> 8.352GB
        - _recv:/usr/lib/python3.10/multiprocessing/connection.py:386 -> 7.101GB
        - get_audio_features:/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/datasets/util/feature_extraction.py:190 -> 3.458GB
        - _extract_features:/home/mgunz/setups/2025-03-07--ws-asr/recipe/apptek_asr/lib/data_postprocessing.py:374 -> 3.458GB
        - get_audio_features:/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/datasets/util/feature_extraction.py:168 -> 3.457GB
        - _apply_speed_perturbation:/home/mgunz/setups/2025-03-07--ws-asr/recipe/apptek_asr/lib/data_postprocessing.py:278 -> 2.762GB
        - _apply_speed_perturbation:/home/mgunz/setups/2025-03-07--ws-asr/recipe/apptek_asr/lib/data_postprocessing.py:283 -> 2.416GB
        - zeros_like:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/core/numeric.py:129 -> 2.414GB
        - _pad_simple:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/lib/arraypad.py:114 -> 1.721GB
        - __call__:/home/mgunz/setups/2025-03-07--ws-asr/recipe/apptek_asr/lib/data_postprocessing.py:177 -> 1.401GB
        - diff:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/lib/function_base.py:1452 -> 772.242MB
        - diff:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/lib/function_base.py:1441 -> 771.573MB
        - valid_audio:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/librosa/util/utils.py:313 -> 691.183MB
        - resample:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/resampy/core.py:134 -> 576.544MB
        - _tokenize:/usr/lib/python3.10/tokenize.py:527 -> 373.258MB
        - _wrapreduction:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/core/fromnumeric.py:88 -> 330.813MB
        - __init__:/usr/lib/python3.10/multiprocessing/reduction.py:39 -> 84.045MB

🥇 Top 20 largest allocating locations (by number of allocations):
        - recv:/usr/lib/python3.10/multiprocessing/connection.py:251 -> 386929311
        - dumps:/usr/lib/python3.10/multiprocessing/reduction.py:51 -> 68922730
        - _tokenize:/usr/lib/python3.10/tokenize.py:527 -> 751792
        - _apply_speed_perturbation:/home/mgunz/setups/2025-03-07--ws-asr/recipe/apptek_asr/lib/data_postprocessing.py:271 -> 383424
        - __init__:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numba/core/types/common.py:50 -> 375888
        - <lambda>:<string>:1 -> 289113
        - __call__:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numba/core/types/abstract.py:67 -> 234946
        - ufunc_can_cast:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numba/np/numpy_support.py:354 -> 219268
        - __reduce_ex__:/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/tensor/_dim_extra.py:332 -> 200626
        - _tokenize:/usr/lib/python3.10/tokenize.py:591 -> 187944
        - _tokenize:/usr/lib/python3.10/tokenize.py:529 -> 187944
        - __init__:/usr/lib/python3.10/multiprocessing/reduction.py:39 -> 186956
        - _name_get:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/core/_dtype.py:363 -> 183308
        - issubclass_:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numpy/core/numerictypes.py:320 -> 154470
        - _tokenize:/usr/lib/python3.10/tokenize.py:600 -> 148789
        - __setstate__:/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/tensor/_tensor_extra.py:600 -> 147466
        - shape:/home/mgunz/setups/2025-03-07--ws-asr/recipe/returnn/returnn/tensor/_tensor_extra.py:1689 -> 143280
        - parent:<frozen importlib._bootstrap>:408 -> 140966
        - __init__:/home/mgunz/src/venv/pythonenv/lib/python3.10/site-packages/numba/core/types/npytypes.py:460 -> 140966
        - __init__:<frozen importlib._bootstrap>:72 -> 140952

Why would the unpickler need to make 380 million allocations over these 10min if all we're sending is TensorDicts? One subepoch takes longer than 10min, and uses about 1.5k steps. That is at max 300k seqs if every step used the full number of seqs (200 in my case). But it doesn't since I use laplace ordering so the real number of seqs will be much lower. Still, that would be ~1200 allocations per seq, which is way more than I would expect. And the statistics show that the majority of allocations falls between 26 and 136 bytes, so they are also really small? It still makes about 45k allocations between 500kb and 2.5MB, so maybe the numpy arrays for the audio data etc. fall into that range. Still, this high allocation count is very strange to me.

Unfortunately I don't get more details on the precise objects that pickle is allocating, that's all I see for now.

@albertz
Copy link
Member

albertz commented Sep 29, 2025

Ok, 1200 allocs per seq doesn't seem like too much for me. Every single object counts here. Even some attrib access will allocate a new temp method wrapper object.

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

https://stackoverflow.com/a/25334520/2000501

Oh, wow. One class instance (without slots) apparently costs at least ~250 bytes. I expected python to use some memory, but did not expect it to be that wasteful. But ok.

@NeoLegends
Copy link
Member Author

NeoLegends commented Sep 29, 2025

I guess we're done here then. No further digging required.

Maybe we can be more efficient at this in general by adding __slots__ to the RETURNN tensor classes (Dim, Tensor) if they don't have it yet. But probably you optimized this already, or have a good reason why we should not do this.

@NeoLegends
Copy link
Member Author

wrt. the gc.collection parameter, I am inclined to add a lower bound like e.g. 100 seqs to the value as well. Reason being that if you use buf_size=1, you will end in the pathological case where you gc.collect() after every seq, which is going to be a massive slowdown. I want to make this reasonably easy for the user, and so that they don't have to think much about this config knob at all.

@albertz
Copy link
Member

albertz commented Sep 29, 2025

Well, I still don't understand where it consumes 400MB. ~1200 allocations per seq doesn't really explain that, and most of these objects are temporary, and immediately deallocated also (no need for GC). Also, for Tensor and Dim, we already use __slots__. But that is not really relevant for the discussion of mem consumption. This will make a few bytes difference (per seq), but does not explain 400MB.

Your log output still does not really show what I want to know. I don't really want to know about the accumulated amount of allocations. I want, at some/any particular point, know what is the allocated memory, what objects, where were they allocated, etc. E.g. in the flamegraph, when you see it has 2GB (or 400MB) allocated, what is this 400MB.

@NeoLegends
Copy link
Member Author

Yes, this is true. I haven‘t been able to dig out this information from the profiles yet. But it‘s probably there. Will find that out.

@NeoLegends
Copy link
Member Author

NeoLegends commented Oct 10, 2025

Setup:

DistributeFilesDataset(
  buf_size=800,
  PostprocessingDataset(
    buf_size=1,
    num_workers=2,
    MetaDataset(HDFDataset(Audio), HDFDataset(Transcription))
  )
)

No explicit GC calls/tuning.

By count it's mostly strs that are allocated. This is from pympler, but I think I need to use an even more advanced profiler to find out where these strings come from.

worker 0: memory usage: 54.6 MiB, after 10 loaded seqs
                                        types |   # objects |   total size
============================================= | =========== | ============
                                          str |      107202 |     17.45 MB
                                         dict |       44352 |     12.31 MB
                                         code |       27693 |      4.72 MB
                                         type |        3770 |      3.50 MB
                                  abc.ABCMeta |        2021 |      2.06 MB
                                        tuple |       34968 |      1.99 MB
                                         list |        5245 |    965.52 KB
                        weakref.ReferenceType |       13006 |    914.48 KB
                                          set |        1479 |    869.48 KB
                                numpy.ndarray |         146 |    602.02 KB
                                    frozenset |        2344 |    548.69 KB
                   builtin_function_or_method |        7160 |    503.44 KB
                                          int |        9057 |    273.95 KB
                          function (__init__) |        1774 |    249.47 KB
                                         cell |        5893 |    230.20 KB
                            getset_descriptor |        3401 |    212.56 KB
                           wrapper_descriptor |        2872 |    201.94 KB
                                 staticmethod |        3463 |    162.33 KB
                            method_descriptor |        2187 |    153.77 KB
                       function (on_disposal) |        1044 |    146.81 KB
     numba.core.types.abstract._TypeMetaclass |         138 |    144.97 KB
                               _abc._abc_data |        2191 |    136.94 KB
                                       method |        1870 |    116.88 KB
     _cython_3_0_10.cython_function_or_method |         571 |    115.98 KB
                      collections.OrderedDict |         228 |    113.77 KB
                                  numpy.ufunc |         465 |    112.62 KB
         _cython_3_0_10.fused_cython_function |         497 |    108.72 KB
                  numba.core.utils.UniqueDict |          19 |     82.39 KB
                          function (__repr__) |         571 |     80.30 KB
                                     property |        1011 |     78.98 KB
          numba.core.types.functions.Function |        1658 |     77.72 KB
                      collections.defaultdict |          47 |     68.93 KB
                                      StgDict |         120 |     68.56 KB
                                   re.Pattern |         137 |     66.29 KB
         numba.core.cpu_options.InlineOptions |        1358 |     63.66 KB
                 _frozen_importlib.ModuleSpec |        1337 |     62.67 KB
                                enum.EnumMeta |          53 |     55.16 KB
  _frozen_importlib_external.SourceFileLoader |        1143 |     53.58 KB
                            member_descriptor |         810 |     50.62 KB
       ctypes.CDLL.__init__.<locals>._FuncPtr |         255 |     47.81 KB
                                        float |        1999 |     46.85 KB
                            inspect.Parameter |         731 |     45.69 KB
                           function (ov_wrap) |         319 |     44.86 KB
                          function (<lambda>) |         292 |     41.06 KB
                       _ctypes.PyCPointerType |          39 |     40.18 KB
                        _ctypes.PyCStructType |          37 |     38.09 KB
                           function (__new__) |         265 |     37.27 KB
                          function (__call__) |         246 |     34.59 KB
                    _collections._tuplegetter |         730 |     34.22 KB
                       function (method_impl) |         243 |     34.17 KB
after 110 loaded seqs:
                                        types |   # objects |   total size
============================================= | =========== | ============
                                          str |      137800 |     20.27 MB
                                         dict |       44398 |     13.16 MB
                                         code |       27693 |      4.73 MB
                                         list |        5256 |      3.99 MB
                                         type |        3783 |      3.51 MB
                                  abc.ABCMeta |        2021 |      2.06 MB
                                        tuple |       34977 |      1.99 MB
                                numpy.ndarray |         167 |      1.19 MB
                        weakref.ReferenceType |       13019 |    915.40 KB
                                          set |        1482 |    870.11 KB
                                    frozenset |        2344 |    548.69 KB
                   builtin_function_or_method |        7160 |    503.44 KB
                                          int |        9079 |    274.60 KB
                          function (__init__) |        1774 |    249.47 KB
                                         cell |        5893 |    230.20 KB
                            getset_descriptor |        3401 |    212.56 KB
                           wrapper_descriptor |        2898 |    203.77 KB
                                 staticmethod |        3463 |    162.33 KB
                            method_descriptor |        2187 |    153.77 KB
                       function (on_disposal) |        1044 |    146.81 KB
     numba.core.types.abstract._TypeMetaclass |         138 |    144.97 KB
                               _abc._abc_data |        2191 |    136.94 KB
                                       method |        1870 |    116.88 KB
     _cython_3_0_10.cython_function_or_method |         571 |    115.98 KB
                      collections.OrderedDict |         228 |    113.77 KB
                                  numpy.ufunc |         465 |    112.62 KB
         _cython_3_0_10.fused_cython_function |         497 |    108.72 KB
                  numba.core.utils.UniqueDict |          19 |     82.39 KB
                          function (__repr__) |         571 |     80.30 KB
                                     property |        1011 |     78.98 KB
          numba.core.types.functions.Function |        1658 |     77.72 KB
                      collections.defaultdict |          47 |     68.93 KB
                                      StgDict |         120 |     68.56 KB
                                   re.Pattern |         137 |     66.29 KB
         numba.core.cpu_options.InlineOptions |        1358 |     63.66 KB
                 _frozen_importlib.ModuleSpec |        1337 |     62.67 KB
                                enum.EnumMeta |          53 |     55.16 KB
  _frozen_importlib_external.SourceFileLoader |        1143 |     53.58 KB
                            member_descriptor |         810 |     50.62 KB
       ctypes.CDLL.__init__.<locals>._FuncPtr |         255 |     47.81 KB
                                        float |        1999 |     46.85 KB
                            inspect.Parameter |         731 |     45.69 KB
                           function (ov_wrap) |         319 |     44.86 KB
                          function (<lambda>) |         292 |     41.06 KB
                       _ctypes.PyCPointerType |          39 |     40.18 KB
                        _ctypes.PyCStructType |          37 |     38.09 KB
                           function (__new__) |         265 |     37.27 KB
                          function (__call__) |         246 |     34.59 KB
                    _collections._tuplegetter |         730 |     34.22 KB
                       function (method_impl) |         243 |     34.17 KB

new
                                        types |   # objects |   total size
============================================= | =========== | ============
                                         list |          11 |      3.05 MB
                                          str |       30598 |      2.63 MB
                                numpy.ndarray |          21 |      2.05 MB
                                         dict |          46 |    873.06 KB
                                         type |          13 |      5.18 KB
                           wrapper_descriptor |          26 |      1.83 KB
                 returnn.tensor.tensor.Tensor |          15 |      1.64 KB
                       returnn.tensor.dim.Dim |          12 |      1.12 KB
                        weakref.ReferenceType |          13 |    936     B
                                          int |          22 |    668     B
                                          set |           3 |    648     B
                                        tuple |           9 |    480     B
        returnn.tensor.tensor_dict.TensorDict |           3 |    144     B
          returnn.tensor._dim_extra._DimExtra |           3 |    144     B
  returnn.datasets.util.vocabulary.Vocabulary |           3 |    144     B
                     functools._lru_list_elem |           1 |     56     B
gone
  types |   # objects |   total size
======= | =========== | ============

@albertz
Copy link
Member

albertz commented Oct 10, 2025

By count it's mostly strs that are allocated. This is from pympler, but I think I need to use an even more advanced profiler to find out where these strings come from.

But if you sum those allocations together, this is much less 2GB? (Also less than 400MB?) Does it really count everything? Or is this now a different setup? Is this then still a relevant setup?

@albertz
Copy link
Member

albertz commented Oct 10, 2025

I am assuming the wrapped HDF/MetaDataset for now, but it would be good to have proof.

What memory are you measuring? I thought you measure only the subproc of PostprocessingDataset? There should be no HDF/MetaDataset in there?

@NeoLegends
Copy link
Member Author

NeoLegends commented Oct 10, 2025

But if you sum those allocations together, this is much less 2GB? (Also less than 400MB?) Does it really count everything? Or is this now a different setup? Is this then still a relevant setup?

The difference is just in the timing of the measurements. I use buf_size=1 in the worker, which eventually results in 400MB peak RSS usage, but only after leaving it running for a while (5min or so). Here I am taking two measurements that are just 100 seqs apart from each other, so just a few seconds. I am doing this because if I measure across more segments, the diff process between the two measurements takes too long (more than 10min) because there are so many new objects. I think what's important here is that there are 30k new str objects that are not yet collected, even across just 100 seqs.

What memory are you measuring? I thought you measure only the subproc of PostprocessingDataset? There should be no HDF/MetaDataset in there?

Ah, of course, yes. I am measuring only the postprocessor worker there. I mixed things up in my head when I wrote this. ...still feeling a bit under the weather today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PostprocessingDataset with multi-processing

3 participants