You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If there happens to be an abnormally long record in a CSV file---where
the rest are short---this abnormally long record ends up causing a
performance loss while parsing subsequent records. Such a thing is
usually caused by a buffer being expanded, and then that expanded buffer
leading to extra cost that shouldn't be paid when parsing smaller
records. Indeed, this case is no exception.
In this case, the standard record iterators use an internal record for
copying CSV data into, and then clone this record as appropriate it the
iterator's `next` method. In this way, that record's memory can be
reused. This is a bit better than just allocating a fresh buffer every
time, since generally speaking, the length of each CSV row is usually
pretty similar to the length of prior rows.
However, in this case, when we come across an exceptionally long record,
the internal record is expanded to handle that record. When that
internal record is clone to give back to the caller, the record *and*
its excess capacity is also cloned. In the case of an abnormally long
record, this ends up copying that extra excess capacity for all
subsequent rows. This easily explains the performance bug.
So to fix it, we introduce a new private method that lets us copy a
record *without* excess capacity. (We could implement `Clone` more
intelligently, but I'm not sure whether it's appropriate to drop excess
capacity in a `Clone` impl. That might be unexpected.) We then use this
new method in the iterators instead of standard `clone`.
In the case where there is no abnormally long records, this shouldn't
have any impact.
Fixes#227
0 commit comments