Incorrect join on large tables for add_tfidf

I've found that the current implementation of `add_tfidf` does not correctly join on the term frequencies for large tables.
Here's an example using `faker` that illustrates the problem

```python
from mismo.sets import add_tfidf
import pandas as pd
import ibis
from ibis import _
from faker import Faker
import numpy as np
ibis.options.interactive = True

f = Faker()
Faker.seed(1234)

addresses = [f.address() for _ in range(2_000_000)]

table = ibis.memtable({"address":addresses}).mutate(tokens=_.address.split(" "))
table = add_tfidf(table, 'tokens')

table
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ address                     ┃ tokens                            ┃ tokens_tfidf                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string                      │ array<string>                     │ map<string, float64>                                                    │
├─────────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
│ USS Hernandez\nFPO AP 03359 │ ['USS', 'Hernandez\nFPO', ... +2] │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USNV Jenkins\nFPO AA 17631  │ ['USNV', 'Jenkins\nFPO', ... +2]  │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USS Stanton\nFPO AP 43458   │ ['USS', 'Stanton\nFPO', ... +2]   │ {'AP': 1.6684075613664435, 'USNS': 2.3627657713296375, ... +2}          │
│ USNS Brown\nFPO AP 97018    │ ['USNS', 'Brown\nFPO', ... +2]    │ {'Olson\nFPO': 4.9226093222060765, 'USNV': 2.3593934563875294, ... +2}  │
│ USS Meza\nFPO AE 53363      │ ['USS', 'Meza\nFPO', ... +2]      │ {'13102': 5.521460917862246, 'Espinoza\nFPO': 5.21556014730925, ... +2} │
│ USNS Espinoza\nFPO AP 13102 │ ['USNS', 'Espinoza\nFPO', ... +2] │ {'USS': 2.357353036528199, 'Baker\nFPO': 4.511930402516782, ... +2}     │
│ USS Baker\nFPO AP 67296     │ ['USS', 'Baker\nFPO', ... +2]     │ {'USNV': 2.3593934563875294, '68208': 5.50607508852887, ... +2}         │
│ USNV Sanchez\nFPO AE 68208  │ ['USNV', 'Sanchez\nFPO', ... +2]  │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ USS Jackson\nFPO AA 20151   │ ['USS', 'Jackson\nFPO', ... +2]   │ {'USS': 2.357353036528199, '49182': 5.665301954088137, ... +2}          │
│ USS Harper\nFPO AP 49182    │ ['USS', 'Harper\nFPO', ... +2]    │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ …                           │ …                                 │ …                                                                       │
└─────────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘
```

I've been able to resolve this myself by caching the result of this line
https://github.com/NickCrews/mismo/blob/fc65234a6c7899780635ac0681b8e8ffaf6d9edb/mismo/sets/_tfidf.py#L252

My gut feeling is that it's related to the lazy evaluation of `ibis.row_number()` which isn't being preserved when joining `term_counts` with `idf`. This feels more like an upstream issue to me as I would expect the row_number to be consistent even if the intermediate table isn't cached



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect join on large tables for add_tfidf #50

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Incorrect join on large tables for add_tfidf #50

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions