Skip to content

Conversation

@Icemole
Copy link
Collaborator

@Icemole Icemole commented Jul 10, 2025

i6_core.lib.corpus.Corpus: runtime complexity reduced of the lookup functions:

  1. get_recording_by_name: O(#recordings) to O(1)
  2. get_segment_by_name: O(#recordings * #segments) to O(1)
  3. remove_recording: O(#recordings) to O(1)
  4. remove_recordings: O(#recordings) to O(#input_recordings_to_function)

i6_core.lib.corpus.Recording: there's now a get_segment_by_name function which runs in O(1) time complexity.

Caveat: memory overhead with respect to the previous implementation from having to store the names twice (once in the dictionary and another in the actual segment).

I did not test the current implementation so I don't know the actual memory overhead. Something like 20-30 bytes (characters) per subcorpus + 40-50 bytes per recordings + 10 bytes per segment sounds reasonable from my experience handling full names. For example, a big corpora (e.g. a corpus with 100k recordings and 10M segments) would have an overhead of (50*100k + 10*10M)B = 105MiB).

I would also be fine by making this a separate implementation. However, in general I think this is useful.

Now return the actual segments instead of the segment full names, following the previous commit
@Icemole Icemole requested review from JackTemaki and SimBe195 July 10, 2025 08:16
@Icemole Icemole changed the title Proposal: corpus/recoridng efficiency improvements Proposal: corpus/recording efficiency improvements Jul 10, 2025
@Icemole
Copy link
Collaborator Author

Icemole commented Jul 14, 2025

I've noticed that we should add the recording to a corpus before adding the segment to the recording, because now recording.add_segment() calls recording.fullname() (edit: and after setting the recording name!). That removes some freedom that the user previously had. Would that be something you would frown upon?

@DanEnergetics
Copy link

Could it be an issue that in the previous implementation with lists one could add duplicate recordings/segments/subcorpora? It's admittedly a rather pathological use case but I'm not aware of all possible uses of the corpus class.
Maybe it's enough, however, to just throw an error when the user tries to add a recording/subcorpus/segment with a duplicated name in the corresponding add_* methods just so the user knows what's going wrong in their setup all of a sudden.

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 15, 2025

Could it be an issue that in the previous implementation with lists one could add duplicate recordings/segments/subcorpora? It's admittedly a rather pathological use case but I'm not aware of all possible uses of the corpus class. Maybe it's enough, however, to just throw an error when the user tries to add a recording/subcorpus/segment with a duplicated name in the corresponding add_* methods just so the user knows what's going wrong in their setup all of a sudden.

That's also a fair point. I will add an assertion that the name shouldn't exist. Thanks for the input!!

Edit: I would also understand that this caused backlash, so I can also remove the functionality.

lib/corpus.py Outdated
e.subcorpora.extend(c.subcorpora)
e.recordings.extend(c.recordings)
e._subcorpora.update(c.subcorpora)
e._recordings.update(c.recordings)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking that we could add a duplication check here, as well.
You could also add the methods add_subcorpora, add_recordings that have this check and add multiple subcorpora/recordings but I don't have a strong preference for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the methods add_subcorpora, add_recordings

As I see it, those would just loop over the parameters (especially if we perform duplication checks, if not we might use dict.update() as shown here) and add them individually, so I don't find these different from the user performing a loop and adding the elements individually.

I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good, and allow for duplicate checks. I initially thought there would be an issue (it didn't feel right), but after more careful thought I think it would be ok. What do you think?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good

Yes that's totally fine by me

@Icemole
Copy link
Collaborator Author

Icemole commented Jul 16, 2025

@DanEnergetics thanks for the review! It made me wonder, do we really need to index the element's full name in the dictionary? We should be able to index the name instead. The internal element is indexed relative to the parent element (since it only exists in the parent's internal dictionary), so any full name indexing is redundant.

I'll change this now.

Icemole and others added 4 commits July 16, 2025 10:57
Allow searching for base name in recording as well, and search in subcorpora when segment not found in main corpus
Copy link

@DanEnergetics DanEnergetics left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your idea is good to forego the full segment names for only the "relative" ones. This seems to make it more reusable as well, e.g. moving a segment to another recording/corpus.

lib/corpus.py Outdated
e.subcorpora.extend(c.subcorpora)
e.recordings.extend(c.recordings)
e._subcorpora.update(c.subcorpora)
e._recordings.update(c.recordings)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good

Yes that's totally fine by me

Also add parent_corpus parameter to corpus init, add else clauses if related object is not provided
Better code on XML parser, docstring improvements, fixes on get_*_by_name through testing
@Icemole
Copy link
Collaborator Author

Icemole commented Aug 12, 2025

I've tested the functionality, and it works wonders! It's only a bit confusing to index subcorpora when it's nested without a natural pattern (e.g. first subcorpus named "corpus", get_recording_by_name potentially requires subcorpus name), but I guess the user should know about it.

Maybe if there's a single subcorpus (many use cases have this feature, from my experience), we could go straight to the subcorpus if the recording/segment is not found at the current level.

@Icemole
Copy link
Collaborator Author

Icemole commented Sep 24, 2025

Hi all, is there anything else that you would like to see in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants