Proposal: corpus/recording efficiency improvements #615

Icemole · 2025-07-10T07:46:53Z

i6_core.lib.corpus.Corpus: runtime complexity reduced of the lookup functions:

get_recording_by_name: O(#recordings) to O(1)
get_segment_by_name: O(#recordings * #segments) to O(1)
remove_recording: O(#recordings) to O(1)
remove_recordings: O(#recordings) to O(#input_recordings_to_function)

i6_core.lib.corpus.Recording: there's now a get_segment_by_name function which runs in O(1) time complexity.

Caveat: memory overhead with respect to the previous implementation from having to store the names twice (once in the dictionary and another in the actual segment).

I did not test the current implementation so I don't know the actual memory overhead. Something like 20-30 bytes (characters) per subcorpus + 40-50 bytes per recordings + 10 bytes per segment sounds reasonable from my experience handling full names. For example, a big corpora (e.g. a corpus with 100k recordings and 10M segments) would have an overhead of (50*100k + 10*10M)B = 105MiB).

I would also be fine by making this a separate implementation. However, in general I think this is useful.

Now return the actual segments instead of the segment full names, following the previous commit

lib/corpus.py

Co-authored-by: Albert Zeyer <[email protected]>

Icemole · 2025-07-14T10:35:01Z

I've noticed that we should add the recording to a corpus before adding the segment to the recording, because now recording.add_segment() calls recording.fullname() (edit: and after setting the recording name!). That removes some freedom that the user previously had. Would that be something you would frown upon?

lib/corpus.py

Co-authored-by: DanEnergetics <[email protected]>

DanEnergetics · 2025-07-15T10:06:57Z

Could it be an issue that in the previous implementation with lists one could add duplicate recordings/segments/subcorpora? It's admittedly a rather pathological use case but I'm not aware of all possible uses of the corpus class.
Maybe it's enough, however, to just throw an error when the user tries to add a recording/subcorpus/segment with a duplicated name in the corresponding add_* methods just so the user knows what's going wrong in their setup all of a sudden.

Icemole · 2025-07-15T14:49:12Z

Could it be an issue that in the previous implementation with lists one could add duplicate recordings/segments/subcorpora? It's admittedly a rather pathological use case but I'm not aware of all possible uses of the corpus class. Maybe it's enough, however, to just throw an error when the user tries to add a recording/subcorpus/segment with a duplicated name in the corresponding add_* methods just so the user knows what's going wrong in their setup all of a sudden.

That's also a fair point. I will add an assertion that the name shouldn't exist. Thanks for the input!!

Edit: I would also understand that this caused backlash, so I can also remove the functionality.

… adding it

lib/corpus.py

corpus/filter.py

DanEnergetics · 2025-07-15T16:24:57Z

lib/corpus.py

-            e.subcorpora.extend(c.subcorpora)
-            e.recordings.extend(c.recordings)
+            e._subcorpora.update(c.subcorpora)
+            e._recordings.update(c.recordings)


I'm thinking that we could add a duplication check here, as well.
You could also add the methods add_subcorpora, add_recordings that have this check and add multiple subcorpora/recordings but I don't have a strong preference for that.

add the methods add_subcorpora, add_recordings

As I see it, those would just loop over the parameters (especially if we perform duplication checks, if not we might use dict.update() as shown here) and add them individually, so I don't find these different from the user performing a loop and adding the elements individually.

I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good, and allow for duplicate checks. I initially thought there would be an issue (it didn't feel right), but after more careful thought I think it would be ok. What do you think?

I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good

Yes that's totally fine by me

Icemole · 2025-07-16T08:56:50Z

@DanEnergetics thanks for the review! It made me wonder, do we really need to index the element's full name in the dictionary? We should be able to index the name instead. The internal element is indexed relative to the parent element (since it only exists in the parent's internal dictionary), so any full name indexing is redundant.

I'll change this now.

Co-authored-by: DanEnergetics <[email protected]>

Allow searching for base name in recording as well, and search in subcorpora when segment not found in main corpus

DanEnergetics

I think your idea is good to forego the full segment names for only the "relative" ones. This seems to make it more reusable as well, e.g. moving a segment to another recording/corpus.

DanEnergetics · 2025-07-18T15:45:38Z

lib/corpus.py

-            e.subcorpora.extend(c.subcorpora)
-            e.recordings.extend(c.recordings)
+            e._subcorpora.update(c.subcorpora)
+            e._recordings.update(c.recordings)


I guess triggering e.add_subcorpus(sc) and e.add_recording(r) in the respective loop here would be just as good

Yes that's totally fine by me

lib/corpus.py

Also add parent_corpus parameter to corpus init, add else clauses if related object is not provided

Better code on XML parser, docstring improvements, fixes on get_*_by_name through testing

Icemole · 2025-08-12T09:31:31Z

I've tested the functionality, and it works wonders! It's only a bit confusing to index subcorpora when it's nested without a natural pattern (e.g. first subcorpus named "corpus", get_recording_by_name potentially requires subcorpus name), but I guess the user should know about it.

Maybe if there's a single subcorpus (many use cases have this feature, from my experience), we could go straight to the subcorpus if the recording/segment is not found at the current level.

Icemole · 2025-09-24T06:05:01Z

Hi all, is there anything else that you would like to see in this PR?

Icemole added 2 commits July 10, 2025 03:24

Convert corpus structure to dict

81cec26

Convert recording structure to dict

3ba9608

Icemole requested review from DanEnergetics, NeoLegends, albertz, curufinwe, hannah220, kuacakuaca, michelwi, moothiringote and sarahberanek July 10, 2025 07:46

Icemole self-assigned this Jul 10, 2025

Fix Corpus.segments() call

b066017

Now return the actual segments instead of the segment full names, following the previous commit

albertz reviewed Jul 10, 2025

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

albertz reviewed Jul 10, 2025

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

Icemole requested review from JackTemaki and SimBe195 July 10, 2025 08:16

Use rsplit instead of splitting and concatenating back

16c7271

Co-authored-by: Albert Zeyer <[email protected]>

Icemole changed the title ~~Proposal: corpus/recoridng efficiency improvements~~ Proposal: corpus/recording efficiency improvements Jul 10, 2025

Add recording at the beginning

28ce889

Icemole added 3 commits July 14, 2025 06:37

Fix

14247a5

Add name to NamedEntity/Corpus/Recording/Segment init

4cd832a

Use newly declared parameters in init

dc8b4e4

DanEnergetics reviewed Jul 14, 2025

View reviewed changes

lib/corpus.py Outdated Show resolved Hide resolved

lib/corpus.py Outdated Show resolved Hide resolved

Icemole and others added 5 commits July 15, 2025 11:07

Directly copy self segments

9d23872

Co-authored-by: DanEnergetics <[email protected]>

Better init

a673f83

Corpus: add subcorpora, recordings properties as read only

1ebeb7d

Update filter segments function

1e678e8

Corpus: add subcorpora, recordings as properties (2)

de59bec

Icemole added 10 commits July 15, 2025 06:32

Set explicit read only properties

d5fa7a9

Improve docstring

151966c

Add remove_segment call

b406ac7

Fix recording segments call

4d356df

Fix Recording.segments alls throughout the repo

42e69d2

Add proper setters

f1f0d73

Take advantage of setter

f671320

Fix recording call

8472e3e

More fixes

cb4856f

Update include corpus

5ac2ec1

Icemole added 2 commits July 15, 2025 10:54

Add assertions that element must not exist in internal structure when…

e388a22

… adding it

Add docstring

e0f9473

DanEnergetics reviewed Jul 15, 2025

View reviewed changes

Icemole and others added 4 commits July 16, 2025 10:57

Apply suggestions from code review

20ad0ad

Co-authored-by: DanEnergetics <[email protected]>

Use name instead of full name

a12a430

Remove redundant conversion to list

c03ec82

Improve retrieval of segments from corpus/recording

9278b1e

Allow searching for base name in recording as well, and search in subcorpora when segment not found in main corpus

DanEnergetics reviewed Jul 18, 2025

View reviewed changes

Icemole added 7 commits August 11, 2025 05:56

Use Corpus API

fdc7315

Add attributes/types to base class

be48a26

Various improvements to user class init

f7b42f8

Also add parent_corpus parameter to corpus init, add else clauses if related object is not provided

Remove unneeded assertion

a160dc6

Add comma

6a955e3

Improve docstring

eda8ab2

Work

a22e8bf

Better code on XML parser, docstring improvements, fixes on get_*_by_name through testing

Proposal: corpus/recording efficiency improvements #615

Are you sure you want to change the base?

Proposal: corpus/recording efficiency improvements #615

Uh oh!

Conversation

Icemole commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Icemole commented Jul 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanEnergetics commented Jul 15, 2025

Uh oh!

Icemole commented Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DanEnergetics Jul 15, 2025

Choose a reason for hiding this comment

Uh oh!

Icemole Jul 16, 2025

Choose a reason for hiding this comment

Uh oh!

DanEnergetics Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Icemole commented Jul 16, 2025

Uh oh!

DanEnergetics left a comment

Choose a reason for hiding this comment

Uh oh!

DanEnergetics Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Icemole commented Aug 12, 2025

Uh oh!

Icemole commented Sep 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Icemole commented Jul 10, 2025 •

edited

Loading

Icemole commented Jul 14, 2025 •

edited

Loading

Icemole commented Jul 15, 2025 •

edited

Loading