-
Notifications
You must be signed in to change notification settings - Fork 401
Closed
Labels
bugSomething isn't workingSomething isn't workingp1Important to tackle soon, but preemptable by p0Important to tackle soon, but preemptable by p0
Description
Describe the bug
When I run the following script that deduplicates a list of strings the output is 4 11:
import daft
vals = [
"jjtzwafmzk",
"fjinpogsnd",
"advcnkwdgr",
"lkloaeeuvg",
"qdmljqvqxv",
"bknitbecis",
"fgqrpbilay",
"advcnkwdgr",
"shsjofbzml",
"zyjbwskjyk",
"utbqewwxoc",
"qdmljqvqxv",
"ezocoaxmsd",
"qdmljqvqxv",
]
daft.set_runner_ray()
df = daft.from_pydict({"val": vals})
ddf_vals = df.into_batches(4).distinct("val")
unique_vals = ddf_vals.count_rows()
print(unique_vals, len(set(vals)))
The value changes when I change the number of batches in into_batches, which is a bit surprising to me.
With into_partitions I get correct results. Does count_rows mean different things in these two contexts?
To Reproduce
I'm using daft version 0.7.2, ray version 2.53.0.
Expected behavior
The output should be `11 11``
Component(s)
Distributed Runner (flotilla)
Additional context
No response
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingp1Important to tackle soon, but preemptable by p0Important to tackle soon, but preemptable by p0