Skip to content

Pivot seems to not respect lazy evaluation #163

@alberto-i

Description

@alberto-i

Hello, is this the expected behavior?

I'm running the code below, using a composition of groupBy, select and inflate and comparing it to a pivot call, both returning the same result. The first call runs in 0.235 ms while the pivot one runs in 146.8 ms, a 62,000% slower. A call to "toArray" takes 51.27 ms with the groupBy and 34.456 ms using pivot. 48 % faster.

Dataset is a 1.5 Mbytes file containing 27k rows.

const dataForge = require('data-forge');
require('data-forge-fs');

let start = process.hrtime();

const elapsed_time = function(note) {
    const precision = 3; // 3 decimal places
    const elapsed = process.hrtime(start)[1] / 1000000; // divide by a million to get nano to milli
    console.log(process.hrtime(start)[0] + " s, " + elapsed.toFixed(precision) + " ms - " + note); // print message + time
    start = process.hrtime(); // reset the timer
}

const df = dataForge
    .readFileSync('./data.csv')
    .parseCSV({ dynamicTyping: true })
    .withIndex((row) => `${row.meeting_id}_${row.item_id}_${row.user_id}_${row.source_id}`)

elapsed_time('parsecsv')

const sintetico = df
    .groupBy((row) => `${row.meeting_id}_${row.item_id}_${row.vote}`)
    .select((group) => ({
        meeting_id: group.first().meeting_id,
        item_id: group.first().item_id,
        vote: group.first().vote,
        stock: group.deflate(row => row.stock).sum(),
    }))
    .inflate()

elapsed_time('groupBy, select, inflate')

const sinteticoPivot = df.pivot(['meeting_id', 'item_id', 'vote'], {
    stock: dataForge.Series.sum
})

elapsed_time('pivot')

const data = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray')

const data2 = sintetico.head(5).toArray()

elapsed_time('groupBy, select, inflate => toArray again')

const data3 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray')

const data4 = sinteticoPivot.head(5).toArray()

elapsed_time('pivot => toArray again')

These are the outputs:

0 s, 183.236 ms - parsecsv
0 s, 0.235 ms - groupBy, select, inflate
0 s, 146.789 ms - pivot
0 s, 51.270 ms - groupBy, select, inflate => toArray
0 s, 1.200 ms - groupBy, select, inflate => toArray again
0 s, 34.456 ms - pivot => toArray
0 s, 13.261 ms - pivot => toArray again

Is this intended? Should I dig deeper to fix it and make a pull request?

Thanks,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions