Understanding `partial_fit` results

Hi there, I'm trying to use `apricot` to help find a diverse set of texts. When I use the `fit` method, everything works intuitively. However, when I start using the `partial_fit` method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected. 

```
# environment setup
pip install textdiversity apricot-select --quiet
```

```
from textdiversity import POSSequenceDiversity
from apricot import FacilityLocationSelection

def chunker(seq, size):
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
    selector = FacilityLocationSelection(
        n_samples=len(texts), 
        metric='euclidean', 
        optimizer='lazy')
    if fit_type == "full_fit":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.fit(Z)
    elif fit_type == "unbatched_partial":
        f, c = d.extract_features(texts)
        Z = d.calculate_similarities(f)
        selector.partial_fit(Z)
    elif fit_type == "batched_partial":
        for batch in chunker(texts, batch_size):
            f, c = d.extract_features(batch)
            Z = d.calculate_similarities(f)
            selector.partial_fit(Z)
    print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")

# test ====================================================

d = POSSequenceDiversity()

texts = ["This is a test.", 
         "This is also a test.", 
         "This is the real deal.", 
         "So is this one."]

test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444

texts = ["This is the real deal.",
         "So is this one.",
         "This is a test.", 
         "This is also a test."]

test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5
```

**Full fit**: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc. 
**Unbatched partial**: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the `partial_fit` method always provides the same ranking despite changes in the underlying order, **this may indicate a bug** or _I don't understand it well enough_.  Please let me know. 
**Batched partial**: This one is responsive to changes in the order of the texts, but a) does not respect the `n_samples` parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset). 

Thanks for taking the time to read + potentially helping me out. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding `partial_fit` results #37

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Understanding partial_fit results #37

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Understanding `partial_fit` results #37