-
Notifications
You must be signed in to change notification settings - Fork 24
Proposal: corpus/recording efficiency improvements #615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Now return the actual segments instead of the segment full names, following the previous commit
Co-authored-by: Albert Zeyer <[email protected]>
|
I've noticed that we should add the recording to a corpus before adding the segment to the recording, because now |
|
Could it be an issue that in the previous implementation with lists one could add duplicate recordings/segments/subcorpora? It's admittedly a rather pathological use case but I'm not aware of all possible uses of the corpus class. |
That's also a fair point. I will add an assertion that the name shouldn't exist. Thanks for the input!! Edit: I would also understand that this caused backlash, so I can also remove the functionality. |
lib/corpus.py
Outdated
| e.subcorpora.extend(c.subcorpora) | ||
| e.recordings.extend(c.recordings) | ||
| e._subcorpora.update(c.subcorpora) | ||
| e._recordings.update(c.recordings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that we could add a duplication check here, as well.
You could also add the methods add_subcorpora, add_recordings that have this check and add multiple subcorpora/recordings but I don't have a strong preference for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the methods add_subcorpora, add_recordings
As I see it, those would just loop over the parameters (especially if we perform duplication checks, if not we might use dict.update() as shown here) and add them individually, so I don't find these different from the user performing a loop and adding the elements individually.
I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good, and allow for duplicate checks. I initially thought there would be an issue (it didn't feel right), but after more careful thought I think it would be ok. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good
Yes that's totally fine by me
|
@DanEnergetics thanks for the review! It made me wonder, do we really need to index the element's full name in the dictionary? We should be able to index the I'll change this now. |
Co-authored-by: DanEnergetics <[email protected]>
Allow searching for base name in recording as well, and search in subcorpora when segment not found in main corpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your idea is good to forego the full segment names for only the "relative" ones. This seems to make it more reusable as well, e.g. moving a segment to another recording/corpus.
lib/corpus.py
Outdated
| e.subcorpora.extend(c.subcorpora) | ||
| e.recordings.extend(c.recordings) | ||
| e._subcorpora.update(c.subcorpora) | ||
| e._recordings.update(c.recordings) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good
Yes that's totally fine by me
Also add parent_corpus parameter to corpus init, add else clauses if related object is not provided
|
I've tested the functionality, and it works wonders! It's only a bit confusing to index subcorpora when it's nested without a natural pattern (e.g. first subcorpus named Maybe if there's a single subcorpus (many use cases have this feature, from my experience), we could go straight to the subcorpus if the recording/segment is not found at the current level. |
|
Hi all, is there anything else that you would like to see in this PR? |
i6_core.lib.corpus.Corpus: runtime complexity reduced of the lookup functions:get_recording_by_name:O(#recordings)toO(1)get_segment_by_name:O(#recordings * #segments)toO(1)remove_recording:O(#recordings)toO(1)remove_recordings:O(#recordings)toO(#input_recordings_to_function)i6_core.lib.corpus.Recording: there's now aget_segment_by_namefunction which runs inO(1)time complexity.Caveat: memory overhead with respect to the previous implementation from having to store the names twice (once in the dictionary and another in the actual segment).
I did not test the current implementation so I don't know the actual memory overhead. Something like 20-30 bytes (characters) per subcorpus + 40-50 bytes per recordings + 10 bytes per segment sounds reasonable from my experience handling full names. For example, a big corpora (e.g. a corpus with 100k recordings and 10M segments) would have an overhead of (50*100k + 10*10M)B = 105MiB).
I would also be fine by making this a separate implementation. However, in general I think this is useful.