-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Description
I've found that the current implementation of add_tfidf does not correctly join on the term frequencies for large tables.
Here's an example using faker that illustrates the problem
from mismo.sets import add_tfidf
import pandas as pd
import ibis
from ibis import _
from faker import Faker
import numpy as np
ibis.options.interactive = True
f = Faker()
Faker.seed(1234)
addresses = [f.address() for _ in range(2_000_000)]
table = ibis.memtable({"address":addresses}).mutate(tokens=_.address.split(" "))
table = add_tfidf(table, 'tokens')
table
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ address ┃ tokens ┃ tokens_tfidf ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string │ array<string> │ map<string, float64> │
├─────────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
│ USS Hernandez\nFPO AP 03359 │ ['USS', 'Hernandez\nFPO', ... +2] │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2} │
│ USNV Jenkins\nFPO AA 17631 │ ['USNV', 'Jenkins\nFPO', ... +2] │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2} │
│ USS Stanton\nFPO AP 43458 │ ['USS', 'Stanton\nFPO', ... +2] │ {'AP': 1.6684075613664435, 'USNS': 2.3627657713296375, ... +2} │
│ USNS Brown\nFPO AP 97018 │ ['USNS', 'Brown\nFPO', ... +2] │ {'Olson\nFPO': 4.9226093222060765, 'USNV': 2.3593934563875294, ... +2} │
│ USS Meza\nFPO AE 53363 │ ['USS', 'Meza\nFPO', ... +2] │ {'13102': 5.521460917862246, 'Espinoza\nFPO': 5.21556014730925, ... +2} │
│ USNS Espinoza\nFPO AP 13102 │ ['USNS', 'Espinoza\nFPO', ... +2] │ {'USS': 2.357353036528199, 'Baker\nFPO': 4.511930402516782, ... +2} │
│ USS Baker\nFPO AP 67296 │ ['USS', 'Baker\nFPO', ... +2] │ {'USNV': 2.3593934563875294, '68208': 5.50607508852887, ... +2} │
│ USNV Sanchez\nFPO AE 68208 │ ['USNV', 'Sanchez\nFPO', ... +2] │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2} │
│ USS Jackson\nFPO AA 20151 │ ['USS', 'Jackson\nFPO', ... +2] │ {'USS': 2.357353036528199, '49182': 5.665301954088137, ... +2} │
│ USS Harper\nFPO AP 49182 │ ['USS', 'Harper\nFPO', ... +2] │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2} │
│ … │ … │ … │
└─────────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘I've been able to resolve this myself by caching the result of this line
Line 252 in fc65234
| with_counts = with_counts.mutate(__row_number=ibis.row_number()) |
My gut feeling is that it's related to the lazy evaluation of ibis.row_number() which isn't being preserved when joining term_counts with idf. This feels more like an upstream issue to me as I would expect the row_number to be consistent even if the intermediate table isn't cached
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels