Skip to content

Commit 41c21bf

Browse files
committed
[SYSTEMDS-3822] Fix incorrect sampling in top-k cleaning pipelines
This patch fixes a bug in top-k cleaning pipeline enumeration, where for datasets with more than 200K rows the sampling ratio was ignored and always set to 0.6 which means we actually ran with larger data than expected, if people wanted to sampling very large datasets.
1 parent b96cf25 commit 41c21bf

File tree

1 file changed

+3
-4
lines changed

1 file changed

+3
-4
lines changed

scripts/pipelines/scripts/utils.dml

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -62,9 +62,8 @@ doSample = function(Matrix[Double] eX, Matrix[Double] eY, Double ratio, Matrix[D
6262
MIN_SAMPLE = 1000
6363
sampledX = eX
6464
sampledY = eY
65-
ratio = ifelse(nrow(eY) > 200000, 0.6, ratio)
6665
sampled = floor(nrow(eX) * ratio)
67-
66+
6867
if(sampled > MIN_SAMPLE & ratio != 1.0)
6968
{
7069
sampleVec = sample(nrow(eX), sampled, FALSE, 23)
@@ -76,7 +75,7 @@ doSample = function(Matrix[Double] eX, Matrix[Double] eY, Double ratio, Matrix[D
7675
}
7776
else if(nrow(eY) == 1) { # for clustering
7877
sampledX = P %*% eX
79-
sampledY = eY
78+
sampledY = eY
8079
}
8180
print("sampled rows "+nrow(sampledY)+" out of "+nrow(eY))
8281
}
@@ -271,4 +270,4 @@ return(Frame[Unknown] data)
271270
# data[, idx] = map(data[, idx], "x -> UtilFunctions.getTimestamp(x)", margin=2)
272271
# }
273272
# }
274-
}
273+
}

0 commit comments

Comments
 (0)