Skip to content

Commit 73cf38b

Browse files
committed
perf: fix record copying performance bug
If there happens to be an abnormally long record in a CSV file---where the rest are short---this abnormally long record ends up causing a performance loss while parsing subsequent records. Such a thing is usually caused by a buffer being expanded, and then that expanded buffer leading to extra cost that shouldn't be paid when parsing smaller records. Indeed, this case is no exception. In this case, the standard record iterators use an internal record for copying CSV data into, and then clone this record as appropriate it the iterator's `next` method. In this way, that record's memory can be reused. This is a bit better than just allocating a fresh buffer every time, since generally speaking, the length of each CSV row is usually pretty similar to the length of prior rows. However, in this case, when we come across an exceptionally long record, the internal record is expanded to handle that record. When that internal record is clone to give back to the caller, the record *and* its excess capacity is also cloned. In the case of an abnormally long record, this ends up copying that extra excess capacity for all subsequent rows. This easily explains the performance bug. So to fix it, we introduce a new private method that lets us copy a record *without* excess capacity. (We could implement `Clone` more intelligently, but I'm not sure whether it's appropriate to drop excess capacity in a `Clone` impl. That might be unexpected.) We then use this new method in the iterators instead of standard `clone`. In the case where there is no abnormally long records, this shouldn't have any impact. Fixes #227
1 parent 6623d87 commit 73cf38b

File tree

3 files changed

+24
-4
lines changed

3 files changed

+24
-4
lines changed

src/byte_record.rs

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -497,6 +497,18 @@ impl ByteRecord {
497497
&self.0.fields[..self.0.bounds.end()]
498498
}
499499

500+
/// Clone this record, but only copy `fields` up to the end of bounds. This
501+
/// is useful when one wants to copy a record, but not necessarily any
502+
/// excess capacity in that record.
503+
#[inline]
504+
pub(crate) fn clone_truncated(&self) -> ByteRecord {
505+
let mut br = ByteRecord::new();
506+
br.0.pos = self.0.pos.clone();
507+
br.0.bounds = self.0.bounds.clone();
508+
br.0.fields = self.0.fields[..self.0.bounds.end()].to_vec();
509+
br
510+
}
511+
500512
/// Retrieve the underlying parts of a byte record.
501513
#[inline]
502514
pub(crate) fn as_parts(&mut self) -> (&mut Vec<u8>, &mut Vec<usize>) {

src/reader.rs

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2050,7 +2050,7 @@ impl<R: io::Read> Iterator for StringRecordsIntoIter<R> {
20502050
fn next(&mut self) -> Option<Result<StringRecord>> {
20512051
match self.rdr.read_record(&mut self.rec) {
20522052
Err(err) => Some(Err(err)),
2053-
Ok(true) => Some(Ok(self.rec.clone())),
2053+
Ok(true) => Some(Ok(self.rec.clone_truncated())),
20542054
Ok(false) => None,
20552055
}
20562056
}
@@ -2087,7 +2087,7 @@ impl<'r, R: io::Read> Iterator for StringRecordsIter<'r, R> {
20872087
fn next(&mut self) -> Option<Result<StringRecord>> {
20882088
match self.rdr.read_record(&mut self.rec) {
20892089
Err(err) => Some(Err(err)),
2090-
Ok(true) => Some(Ok(self.rec.clone())),
2090+
Ok(true) => Some(Ok(self.rec.clone_truncated())),
20912091
Ok(false) => None,
20922092
}
20932093
}
@@ -2126,7 +2126,7 @@ impl<R: io::Read> Iterator for ByteRecordsIntoIter<R> {
21262126
fn next(&mut self) -> Option<Result<ByteRecord>> {
21272127
match self.rdr.read_byte_record(&mut self.rec) {
21282128
Err(err) => Some(Err(err)),
2129-
Ok(true) => Some(Ok(self.rec.clone())),
2129+
Ok(true) => Some(Ok(self.rec.clone_truncated())),
21302130
Ok(false) => None,
21312131
}
21322132
}
@@ -2163,7 +2163,7 @@ impl<'r, R: io::Read> Iterator for ByteRecordsIter<'r, R> {
21632163
fn next(&mut self) -> Option<Result<ByteRecord>> {
21642164
match self.rdr.read_byte_record(&mut self.rec) {
21652165
Err(err) => Some(Err(err)),
2166-
Ok(true) => Some(Ok(self.rec.clone())),
2166+
Ok(true) => Some(Ok(self.rec.clone_truncated())),
21672167
Ok(false) => None,
21682168
}
21692169
}

src/string_record.rs

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -610,6 +610,14 @@ impl StringRecord {
610610
self.0
611611
}
612612

613+
/// Clone this record, but only copy `fields` up to the end of bounds. This
614+
/// is useful when one wants to copy a record, but not necessarily any
615+
/// excess capacity in that record.
616+
#[inline]
617+
pub(crate) fn clone_truncated(&self) -> StringRecord {
618+
StringRecord(self.0.clone_truncated())
619+
}
620+
613621
/// A safe function for reading CSV data into a `StringRecord`.
614622
///
615623
/// This relies on the internal representation of `StringRecord`.

0 commit comments

Comments
 (0)