- Introduction
- Migrating from Ruby CSV
- Ruby CSV Pitfalls
- Parsing Strategy
- The Basic Read API
- The Basic Write API
- Batch Processing
- Configuration Options
- Row and Column Separators
- Header Transformations
- Header Validations
- Column Selection
- Data Transformations
- Value Converters
- Bad Row Quarantine
- Instrumentation Hooks
- Examples
- Real-World CSV Files
- SmarterCSV over the Years
- Release Notes
Let's explore the basic APIs for reading and writing CSV files. There is a simplified API (backwards conpatible with previous SmarterCSV versions) and the full API, which allows you to access the internal state of the reader or writer instance after processing.
SmarterCSV has convenient defaults for automatically detecting row and column separators based on the given data. This provides more robust parsing of input files when you have no control over the data, e.g. when users upload CSV files. Learn more about this in this section.
The simplified call to read CSV files is:
```
array_of_hashes = SmarterCSV.process(file_or_input, options)
```
To parse a CSV string directly (no file needed), use SmarterCSV.parse:
```
array_of_hashes = SmarterCSV.parse(csv_string, options)
```
This is equivalent to SmarterCSV.process(StringIO.new(csv_string), options) and is the
idiomatic replacement for CSV.parse(str, headers: true, header_converters: :symbol).
See Migrating from Ruby CSV for a full comparison.
It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
```
SmarterCSV.process(file_or_input, options) do |array_of_hashes|
# without chunk_size, each yield contains a one-element array (one row)
end
```
or
```
SmarterCSV.process(file_or_input, options) do |array_of_hashes, chunk_index|
# the chunk_index can be used to track chunks for parallel processing
end
```
When processing batches of rows, use the chunk_size option. The block receives an array of up to chunk_size hashes per yield:
```
SmarterCSV.process(file_or_input, {chunk_size: 100}) do |array_of_hashes, chunk_index|
# process one chunk of up to 100 rows of CSV data
puts "Processing chunk #{chunk_index}..."
end
```
The simplified API works in most cases, but if you need access to the internal state and detailed results of the CSV-parsing, you should use this form:
```
reader = SmarterCSV::Reader.new(file_or_input, options)
data = reader.process
puts reader.raw_headers
```
It can also be used with a block. The block always receives an array of hashes and an optional chunk index:
```
reader = SmarterCSV::Reader.new(file_or_input, options)
data = reader.process do |array_of_hashes, chunk_index|
# do something here
end
puts reader.raw_headers
```
This allows you access to the internal state of the reader instance after processing.
Reader#each is the modern, idiomatic way to read CSV rows one at a time. It always yields a single Hash per row and includes Enumerable, so every standard Ruby enumerable method works out of the box.
SmarterCSV.each('data.csv', options) do |hash|
MyModel.upsert(hash)
endreader = SmarterCSV::Reader.new('data.csv', options)
reader.each do |hash|
MyModel.upsert(hash)
end
puts reader.headers # accessible after processing
puts reader.errors.inspectenum = SmarterCSV.each('data.csv', options)
enum.to_a # => [{ name: "Alice", ... }, { name: "Bob", ... }, ...]Because Reader includes Enumerable, all standard Ruby enumerable methods work:
reader = SmarterCSV::Reader.new('data.csv', options)
# Filter rows
us_users = reader.select { |h| h[:country] == 'US' }
# Transform
names = reader.map { |h| h[:name] }
# Count good rows
reader.count
# Row index (0-based count of successfully parsed rows, excluding bad rows)
reader.each_with_index do |hash, i|
puts "Row #{i}: #{hash[:name]}"
end
# Free chunking via Enumerable — no chunk_size needed
reader.each_slice(100) do |batch|
MyModel.insert_all(batch)
endlazy lets you stop early without reading the entire file:
# Read only the first 10 rows matching a condition
reader = SmarterCSV::Reader.new('big.csv', options)
result = reader.lazy.select { |h| h[:status] == 'active' }.first(10)If chunk_size is set in options, each ignores it and always yields individual Hash objects. Use each_chunk for chunked batch processing.
each respects all on_bad_row options. Bad rows are skipped (or routed to your handler) and never yielded:
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
reader.each { |hash| MyModel.upsert(hash) }
reader.errors[:bad_rows].each { |rec| puts "Bad row: #{rec[:error_message]}" }After each row is parsed, SmarterCSV applies transformations to field values in this order:
| Step | Option | Default | Description |
|---|---|---|---|
| 1 | strip_whitespace |
true |
Strips leading/trailing whitespace from all values (and headers) at parse time |
| 2 | nil_values_matching |
nil |
Sets values matching the regexp to nil |
| 3 | remove_empty_values |
true |
Removes keys whose value is nil or blank |
| 4 | remove_zero_values |
false |
Removes keys whose value is numeric zero |
| 5 | convert_values_to_numeric |
true |
Converts numeric-looking strings to Integer or Float |
| 6 | value_converters |
nil |
Applies per-key custom converter lambdas or classes |
| 7 | remove_empty_hashes |
true |
Drops rows that are entirely empty after all transformations |
Steps 2–6 run per field, in that order, for every key/value pair in the row.
value_convertersreceive the value after numeric conversion — guard againstInteger/Floatinput if needed.
See Data Transformations and Value Converters for details.
Before any data rows are processed, the header line passes through these steps:
comment_regexp → strip_chars_from_headers → split on col_sep → strip quote_char
→ strip_whitespace → [gsub spaces/dashes→_ → downcase_header]
→ disambiguate_headers → symbolize → key_mapping
user_provided_headers bypasses the file header and all transformation steps — your array is used as-is.
See Header Transformations for the full step-by-step table and options.
While SmarterCSV uses sensible defaults to process the most common CSV files, it will raise exceptions if it can not auto-detect col_sep, row_sep, or if it encounters other problems. Therefore please rescue from SmarterCSV::Error, and handle outliers according to your requirements.
If you encounter unusual CSV files, please follow the tips in the Troubleshooting section below. You can use the options below to accommodate for unusual formats.
In case your CSV file is not being parsed correctly, try to examine it in a text editor. For closer inspection a tool like hexdump can help find otherwise hidden control character or byte sequences like BOMs.
$ hexdump -C spec/fixtures/bom_test_feff.csv
00000000 fe ff 73 6f 6d 65 5f 69 64 2c 74 79 70 65 2c 66 |..some_id,type,f|
00000010 75 7a 7a 62 6f 78 65 73 0d 0a 34 32 37 36 36 38 |uzzboxes..427668|
00000020 30 35 2c 7a 69 7a 7a 6c 65 73 2c 31 32 33 34 0d |05,zizzles,1234.|
00000030 0a 33 38 37 35 39 31 35 30 2c 71 75 69 7a 7a 65 |.38759150,quizze|
00000040 73 2c 35 36 37 38 0d 0a |s,5678..|
- By default, quote escaping uses
:automode — SmarterCSV tries backslash-escape (\") first and falls back to RFC 4180 doubled-quotes (""). Usequote_escaping: :double_quotesor:backslashto fix the mode explicitly. See Parsing Strategy. - Quote characters around fields are expected to be balanced, e.g. valid:
"field", invalid:"field\"— an escapedquote_chardoes not denote the end of a field.
- if you have a CSV file which contains unicode characters, you can process it as follows:
File.open(filename, "r:bom|utf-8") do |f|
data = SmarterCSV.process(f);
end- if the CSV file with unicode characters is in a remote location, similarly you need to give the encoding as an option to the
opencall:
require 'open-uri'
file_location = 'http://your.remote.org/sample.csv'
open(file_location, 'r:utf-8') do |f| # don't forget to specify the UTF-8 encoding!!
data = SmarterCSV.process(f)
endPREVIOUS: Parsing Strategy | NEXT: The Basic Write API | UP: README