Skip to content

Understanding partial_fit results #37

@fabriceyhc

Description

@fabriceyhc

Hi there, I'm trying to use apricot to help find a diverse set of texts. When I use the fit method, everything works intuitively. However, when I start using the partial_fit method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.

# environment setup
pip install textdiversity apricot-select --quiet
from textdiversity import POSSequenceDiversity
from apricot import FacilityLocationSelection

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
    selector = FacilityLocationSelection(
        n_samples=len(texts), 
        metric='euclidean', 
        optimizer='lazy')
    if fit_type == "full_fit":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.fit(Z)
    elif fit_type == "unbatched_partial":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.partial_fit(Z)
    elif fit_type == "batched_partial":
        for batch in chunker(texts, batch_size):
            f, c = d.extract_features(batch)
            Z = d.calculate_similarities(f)
            selector.partial_fit(Z)
    print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")

# test ====================================================

d = POSSequenceDiversity()

texts = ["This is a test.", 
         "This is also a test.", 
         "This is the real deal.", 
         "So is this one."]

test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444

texts = ["This is the real deal.",
         "So is this one.",
         "This is a test.", 
         "This is also a test."]

test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5

Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc.
Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the partial_fit method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know.
Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the n_samples parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).

Thanks for taking the time to read + potentially helping me out.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions