-
Notifications
You must be signed in to change notification settings - Fork 133
PostprocessingDataset
: add support for multiprocessing
#1765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
PostprocessingDataset
Let's discuss the implementation first before you implement sth. See my comment in #1701. I think we can simply share/reuse the |
We should also fix #1766 first. |
I understand it's not yet clear if it's safe to always copy the raw tensor in Tensor So, for the purposes of this PR, where I want to send TensorDicts across a pickled pipe, do you think it's reasonable to introduce such a scope/contextmanager (as already mentioned in #1541) that configures whether the With the contextmanager we do not need to know now whether it's always safe to do copy. If it then eventually turns out it's always safe to copy the |
I think we should just fix #1541. I.e. first thing is to understand whether it is maybe safe to just always copy it. Or understand when it is not safe. I would avoid introducing some code complexity (like such a context scope) just because we want to avoid understanding the code. |
39d6521
to
b700514
Compare
assert self._dataset is not None | ||
self._dataset.init_seq_order(epoch=epoch, seq_list=seq_list, seq_order=seq_order) | ||
self._data_iter = enumerate(self._build_mapping_iter()) | ||
if self._num_workers > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it makes sense to use a dedicated subclass for the multi-process variant to keep the base implementation more simple and avoid mixing in concerns of process management etc.
WDYT @albertz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, the result has come out really cleanly separated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@albertz see here for why I made a subclass instead. I did not like how the multiprocessing details bled into the existing, straightforward implementation and made it less straightforward. There were class fields that were only used for one of the two cases but not the other, the data iterator would be dynamically either one or the other variant, etc.
It seems like manually calling I guess from this finding it's sensible to add a Flamegraph WIP |
Even 2GB is already much more than I would expect. A Python interpreter with Numpy imported takes rss=36.6MB pss=36.1MB uss=36.0MB shared=596.0KB for me. |
Ah, I've been running with buf_size=200 so far, so some memory usage is expected. With buf_size=1 and frequent GC the memory usage is at about 400MB. |
Buffer size 200 seems way too much. I would expect sth like 1-10 is enough here? |
This seems still a bit too much to me. Where exactly is it consumed? As said, I would expect more like 40MB. |
Yeah, maybe. I tend to use larger buffers to make sure the data pipeline never stalls.
Hm, maybe. But I also don't think it's that problematic anymore then, maybe digging even deeper into where this memory is spent is "off topic" for this PR? This 400MB is the resident size, and as seen is about 2x the heap size here. |
I'm now looking into the allocation counts, they seem suspiciously high to me. In the 10 minutes of profiling time the process made about 930M allocations, many of which are made by the unpickler. Maybe we have an issue w/ unpickling? |
12a5186
to
d2a9101
Compare
I'm not sure this is off topic for the PR. For that, we first need to understand where it comes from. From your flamegraph, I don't really get: where is the mem allocated to, to what objects, where were they created, etc? |
Why would the unpickler need to make 380 million allocations over these 10min if all we're sending is Unfortunately I don't get more details on the precise objects that pickle is allocating, that's all I see for now. |
Ok, 1200 allocs per seq doesn't seem like too much for me. Every single object counts here. Even some attrib access will allocate a new temp method wrapper object. |
https://stackoverflow.com/a/25334520/2000501 Oh, wow. One class instance (without slots) apparently costs at least ~250 bytes. I expected python to use some memory, but did not expect it to be that wasteful. But ok. |
I guess we're done here then. No further digging required. Maybe we can be more efficient at this in general by adding |
wrt. the gc.collection parameter, I am inclined to add a lower bound like e.g. 100 seqs to the value as well. Reason being that if you use buf_size=1, you will end in the pathological case where you gc.collect() after every seq, which is going to be a massive slowdown. I want to make this reasonably easy for the user, and so that they don't have to think much about this config knob at all. |
Well, I still don't understand where it consumes 400MB. ~1200 allocations per seq doesn't really explain that, and most of these objects are temporary, and immediately deallocated also (no need for GC). Also, for Your log output still does not really show what I want to know. I don't really want to know about the accumulated amount of allocations. I want, at some/any particular point, know what is the allocated memory, what objects, where were they allocated, etc. E.g. in the flamegraph, when you see it has 2GB (or 400MB) allocated, what is this 400MB. |
Yes, this is true. I haven‘t been able to dig out this information from the profiles yet. But it‘s probably there. Will find that out. |
Setup: DistributeFilesDataset(
buf_size=800,
PostprocessingDataset(
buf_size=1,
num_workers=2,
MetaDataset(HDFDataset(Audio), HDFDataset(Transcription))
)
) No explicit GC calls/tuning. By count it's mostly
|
But if you sum those allocations together, this is much less 2GB? (Also less than 400MB?) Does it really count everything? Or is this now a different setup? Is this then still a relevant setup? |
What memory are you measuring? I thought you measure only the subproc of |
The difference is just in the timing of the measurements. I use
Ah, of course, yes. I am measuring only the postprocessor worker there. I mixed things up in my head when I wrote this. ...still feeling a bit under the weather today. |
Closes #1701
Currently fails due to #1763.