Skip to content

Incorrect join on large tables for add_tfidf #50

@jstammers

Description

@jstammers

I've found that the current implementation of add_tfidf does not correctly join on the term frequencies for large tables.
Here's an example using faker that illustrates the problem

from mismo.sets import add_tfidf
import pandas as pd
import ibis
from ibis import _
from faker import Faker
import numpy as np
ibis.options.interactive = True

f = Faker()
Faker.seed(1234)

addresses = [f.address() for _ in range(2_000_000)]

table = ibis.memtable({"address":addresses}).mutate(tokens=_.address.split(" "))
table = add_tfidf(table, 'tokens')

table
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ addresstokenstokens_tfidf                                                            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ stringarray<string>map<string, float64>                                                    │
├─────────────────────────────┼───────────────────────────────────┼─────────────────────────────────────────────────────────────────────────┤
│ USS Hernandez\nFPO AP 03359 │ ['USS', 'Hernandez\nFPO', ... +2] │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USNV Jenkins\nFPO AA 17631  │ ['USNV', 'Jenkins\nFPO', ... +2]  │ {'USS': 2.357353036528199, 'AP': 1.6684075613664435, ... +2}            │
│ USS Stanton\nFPO AP 43458   │ ['USS', 'Stanton\nFPO', ... +2]   │ {'AP': 1.6684075613664435, 'USNS': 2.3627657713296375, ... +2}          │
│ USNS Brown\nFPO AP 97018    │ ['USNS', 'Brown\nFPO', ... +2]    │ {'Olson\nFPO': 4.9226093222060765, 'USNV': 2.3593934563875294, ... +2}  │
│ USS Meza\nFPO AE 53363      │ ['USS', 'Meza\nFPO', ... +2]      │ {'13102': 5.521460917862246, 'Espinoza\nFPO': 5.21556014730925, ... +2} │
│ USNS Espinoza\nFPO AP 13102 │ ['USNS', 'Espinoza\nFPO', ... +2] │ {'USS': 2.357353036528199, 'Baker\nFPO': 4.511930402516782, ... +2}     │
│ USS Baker\nFPO AP 67296     │ ['USS', 'Baker\nFPO', ... +2]     │ {'USNV': 2.3593934563875294, '68208': 5.50607508852887, ... +2}         │
│ USNV Sanchez\nFPO AE 68208  │ ['USNV', 'Sanchez\nFPO', ... +2]  │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ USS Jackson\nFPO AA 20151   │ ['USS', 'Jackson\nFPO', ... +2]   │ {'USS': 2.357353036528199, '49182': 5.665301954088137, ... +2}          │
│ USS Harper\nFPO AP 49182    │ ['USS', 'Harper\nFPO', ... +2]    │ {'USNV': 2.3593934563875294, 'AA': 1.6625767137942553, ... +2}          │
│ …                           │ …                                 │ …                                                                       │
└─────────────────────────────┴───────────────────────────────────┴─────────────────────────────────────────────────────────────────────────┘

I've been able to resolve this myself by caching the result of this line

with_counts = with_counts.mutate(__row_number=ibis.row_number())

My gut feeling is that it's related to the lazy evaluation of ibis.row_number() which isn't being preserved when joining term_counts with idf. This feels more like an upstream issue to me as I would expect the row_number to be consistent even if the intermediate table isn't cached

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions