|
1 | 1 | --- |
2 | | -title: "A New CSV Library: 6x Faster, 40x Less Memory" |
| 2 | +title: "A New CSV Library: Built for SQL Server" |
3 | 3 | date: 2025-11-30 |
4 | 4 | author: "Chrissy LeMaire" |
5 | 5 | slug: "new-csv-library" |
@@ -31,22 +31,45 @@ What came back was fast as heck and used several patterns (apparently `Span<T>`, |
31 | 31 |
|
32 | 32 | ## The results |
33 | 33 |
|
34 | | -Using Claude to figure out benchmarking, I ran some proper benchmarks and the new Dataplat.Dbatools.Csv library isn't just a little faster. It's in a completely different performance class. |
| 34 | +Using Claude to figure out benchmarking, I ran proper benchmarks comparing Dataplat.Dbatools.Csv against not just LumenWorks, but also the modern CSV libraries: Sep, Sylvan, and CsvHelper. |
35 | 35 |
|
36 | | -| Scenario | Dataplat | LumenWorks | Speed Boost | Memory Savings | |
37 | | -|----------|----------|------------|-------------|----------------| |
38 | | -| **Small** (1K rows) | 0.83 ms | 3.26 ms | **3.9x faster** | **25x less** | |
39 | | -| **Medium** (100K rows) | 65.3 ms | 364.5 ms | **5.6x faster** | **41x less** | |
40 | | -| **Large** (1M rows) | 559 ms | 3,435 ms | **6.1x faster** | **40x less** | |
41 | | -| **Wide** (100K×50 cols) | 277 ms | 493 ms | **1.8x faster** | **7.3x less** | |
| 36 | +**Benchmark: 100,000 rows × 10 columns (.NET 8, AVX-512)** |
42 | 37 |
|
43 | | -Processing 1 million rows (96 MB CSV file): |
44 | | -- **Dataplat**: 0.56 seconds using 420 MB RAM |
45 | | -- **LumenWorks**: 3.4 seconds using 16.7 GB RAM |
| 38 | +Here's the interesting thing: performance varies dramatically depending on how you access the data. |
46 | 39 |
|
47 | | -That's a **6.1x speed improvement** with **40x less memory allocation**. The memory difference is honestly the bigger deal here. LumenWorks creates so much garbage that large files can cause `OutOfMemoryException` on machines that should easily handle them and as a matter of fact, my benchmarking crashed my browser too. |
| 40 | +**Single column read (typical SqlBulkCopy/IDataReader pattern):** |
48 | 41 |
|
49 | | -6.1x was the max of all the benchmarks that I ran, though 4.7x was the average. |
| 42 | +| Library | Time (ms) | vs Dataplat | |
| 43 | +|---------|-----------|-------------| |
| 44 | +| Sep | 19 ms | 3.8x faster | |
| 45 | +| Sylvan | 29 ms | 2.5x faster | |
| 46 | +| **Dataplat** | **74 ms** | **baseline** | |
| 47 | +| CsvHelper | 76 ms | ~same | |
| 48 | +| LumenWorks | 433 ms | **5.9x slower** | |
| 49 | + |
| 50 | +**All columns read (full row processing):** |
| 51 | + |
| 52 | +| Library | Time (ms) | vs Dataplat | |
| 53 | +|---------|-----------|-------------| |
| 54 | +| Sep | 35 ms | 2.1x faster | |
| 55 | +| Sylvan | 37 ms | 2.0x faster | |
| 56 | +| **Dataplat** | **73 ms** | **baseline** | |
| 57 | +| CsvHelper | 101 ms | 1.4x slower | |
| 58 | +| LumenWorks | 100 ms | 1.4x slower | |
| 59 | + |
| 60 | +For the single-column pattern (which is how SqlBulkCopy typically reads data), Dataplat is **~6x faster** than LumenWorks! For full row processing, we're still **~1.4x faster**. |
| 61 | + |
| 62 | +### Where we stand in 2025 |
| 63 | + |
| 64 | +Being honest: if pure parsing speed is your only concern, [Sep](https://github.com/nietras/Sep/) is faster. Sep can hit 21 GB/s with AVX-512 SIMD. But our library isn't trying to be Sep. We're built for **database import workflows** where you need: |
| 65 | + |
| 66 | +- **IDataReader interface** - Stream directly to SqlBulkCopy without intermediate allocations |
| 67 | +- **Built-in compression** - Import `.csv.gz` files without extracting first |
| 68 | +- **Real-world data handling** - Lenient parsing for messy enterprise exports |
| 69 | +- **Progress reporting** - Know how far along your 10 million row import is |
| 70 | +- **dbatools integration** - Works seamlessly with Import-DbaCsv |
| 71 | + |
| 72 | +If you're doing `file.csv.gz → SqlBulkCopy → SQL Server`, our complete workflow may actually be faster than combining Sep + manual decompression + manual IDataReader wrapping. |
50 | 73 |
|
51 | 74 | ### Why is it so much faster? |
52 | 75 |
|
@@ -124,6 +147,32 @@ German CSV with comma as decimal separator? French dates? We got you: |
124 | 147 | Import-DbaCsv -Path german_data.csv -SqlInstance sql01 -Database tempdb -Culture "de-DE" -AutoCreateTable |
125 | 148 | ``` |
126 | 149 |
|
| 150 | +### Progress reporting (v1.1.0) |
| 151 | + |
| 152 | +For those big imports where you want to know what's happening: |
| 153 | + |
| 154 | +```csharp |
| 155 | +var options = new CsvReaderOptions |
| 156 | +{ |
| 157 | + ProgressReportInterval = 10000, |
| 158 | + ProgressCallback = progress => |
| 159 | + { |
| 160 | + Console.WriteLine($"Processed {progress.RecordsRead:N0} records ({progress.RowsPerSecond:N0}/sec)"); |
| 161 | + } |
| 162 | +}; |
| 163 | +``` |
| 164 | + |
| 165 | +### Cancellation support (v1.1.0) |
| 166 | + |
| 167 | +Long-running import and need to stop it? CancellationToken support is built in: |
| 168 | + |
| 169 | +```csharp |
| 170 | +var options = new CsvReaderOptions |
| 171 | +{ |
| 172 | + CancellationToken = cancellationTokenSource.Token |
| 173 | +}; |
| 174 | +``` |
| 175 | + |
127 | 176 | ## A brand new command: Export-DbaCsv |
128 | 177 |
|
129 | 178 | This one's been requested for years ([GitHub issue #8646](https://github.com/dataplat/dbatools/issues/8646)). We finally have a proper Export-DbaCsv with compression support: |
|
0 commit comments