You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the __slots__ attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a __dict__ attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.
Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e., stmt.evidence) one gets a view of the list evidences rather than a reference to them. So directly manipulating stmt.evidence will not result in persistent changes to the Statement. Rather, one has to do something like:
evs = stmt.evidence
for ev in evs:
# Make some changes to each ev object
stmt.evidence = evs
to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.
Well, users of INDRA would never really notice any change, it's only during internal development (of e.g., pre-assembly algorithms or input processors) that one could make a mistake by attempting to change a view of a list of Evidences rather than the actual evidence attribute of a Statement. Saving into a variable is not really necessary, the key is just to always set evidences as stmt.evidence = [...] to update the actual evidence list attribute rather than attempt to iterate over and manipulate stmt.evidence[idx] directly, which with this change would just change a view of the evidences. I agree it is somewhat confusing hence my ambivalence about the change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR implements two optimizations to the representation of evidences that significantly decrease memory usage when manipulating large sets of INDRA Statements. The bulk of memory used by INDRA Statements is attributable to the Evidence objects (incl. evidence text) that are attached to them. One approach to decrease memory usage is to define the
__slots__attribute of Evidence to make sure the set of attributes it can have is pre-defined (rather than variable via a__dict__attribute). This seemed to make a minor difference in memory usage. Much larger memory savings can be achieved if lists of Evidences attached to a Statement are stored in a serialized, compressed form, and only decompressed and deserialized when being accessed. Based on some experiments, a Statement with 100 pieces of Evidence uses 75% less memory using this PR. On some large assembled corpora that I tried, which have Statements with a mixture of number of Evidences, 80% lower memory usage is typical.Not much of this affects the way INDRA Statements are used, however there is one important difference: when accessing a Statement's evidence (i.e.,
stmt.evidence) one gets a view of the list evidences rather than a reference to them. So directly manipulatingstmt.evidencewill not result in persistent changes to the Statement. Rather, one has to do something like:to make changes to a Statement's list of Evidences. Some specialized code dealing with Evidence manipulation, as well as some tests needed to be updated. I am still ambivalent about whether this change will cause confusion later, and therefore not sure yet if this PR should be merged.