Skip to content

Commit 5154585

Browse files
russellbshivchander
authored andcommitted
filterblock: Don't break when filtering on unexpected data
The previous code in this block did filtering assuming that all samples had a value that was correct for the type. For example, when filtering on an integer value, it assumed every row had a valid integer, where it may instead have garbage. This change introduces a new helper, _convert_dtype(), which properly handles this condition. When the conversion fails on a `ValueError` exception, it treats it as `None` instead of allowing the exception to be raised up to the caller. The fix was authored by Shiv in PR instructlab#72. I only pulled it out into a standalone commit. Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: shiv <[email protected]>
1 parent 5bff598 commit 5154585

File tree

1 file changed

+11
-4
lines changed

1 file changed

+11
-4
lines changed

src/instructlab/sdg/filterblock.py

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -20,13 +20,20 @@ def __init__(
2020
self.convert_dtype = convert_dtype
2121
self.num_procs = batch_kwargs.get("num_procs", 1)
2222

23+
def _convert_dtype(self, sample):
24+
try:
25+
sample[self.column_name] = self.convert_dtype(sample[self.column_name])
26+
except ValueError as e:
27+
logger.error(
28+
"Error converting dtype: %s, filling with None to be filtered later", e
29+
)
30+
sample[self.column_name] = None
31+
return sample
32+
2333
def generate(self, samples) -> Dataset:
2434
if self.convert_dtype:
2535
samples = samples.map(
26-
lambda x: {
27-
**x,
28-
self.column_name: self.convert_dtype(x[self.column_name]),
29-
},
36+
self._convert_dtype,
3037
num_proc=self.num_procs,
3138
)
3239

0 commit comments

Comments
 (0)