Skip to content

Commit 3765670

Browse files
committed
Add CSV as input format for convert, count, schema, head, tail with --has-headers option
Made-with: Cursor
1 parent 32c2cc5 commit 3765670

23 files changed

+571
-115
lines changed

README.md

Lines changed: 43 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,8 @@ cargo install --git https://github.com/aisrael/datu
2626
| Parquet (`.parquet`, `.parq`) ||||
2727
| Avro (`.avro`) ||||
2828
| ORC (`.orc`) ||||
29+
| CSV (`.csv`) ||||
2930
| XLSX (`.xlsx`) ||||
30-
| CSV (`.csv`) ||||
3131
| JSON (`.json`) ||||
3232
| JSON (pretty) ||||
3333
| YAML ||||
@@ -36,6 +36,8 @@ cargo install --git https://github.com/aisrael/datu
3636
- **Write** — Output file formats for `convert`.
3737
- **Display** — Output format when printing to stdout (`schema`, `head`, `tail` via `--output`: csv, json, json-pretty, yaml).
3838

39+
**CSV options:** When reading CSV files, the `--has-headers` option controls whether the first row is treated as column names. Omitted or `--has-headers` means true (header present); `--has-headers=false` for headerless CSV. Applies to `convert`, `count`, `schema`, `head`, and `tail`.
40+
3941
Usage
4042
=====
4143

@@ -60,9 +62,9 @@ Perform the same conversion and column filtering.
6062

6163
### `schema`
6264

63-
Display the schema of a Parquet, Avro, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data.
65+
Display the schema of a Parquet, Avro, CSV, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data. CSV schema uses type inference from the data.
6466

65-
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
67+
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`).
6668

6769
**Usage:**
6870

@@ -75,6 +77,7 @@ datu schema <FILE> [OPTIONS]
7577
| Option | Description |
7678
|--------|-------------|
7779
| `--output <FORMAT>` | Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`. |
80+
| `--has-headers [BOOL]` | For CSV input: whether the first row is a header. Default: true when omitted. Use `--has-headers=false` for headerless CSV. |
7881

7982
**Output formats:**
8083

@@ -104,25 +107,35 @@ datu schema events.avro -o YAML
104107

105108
### `count`
106109

107-
Return the number of rows in a Parquet, Avro, or ORC file.
110+
Return the number of rows in a Parquet, Avro, CSV, or ORC file.
108111

109-
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
112+
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`).
110113

111114
**Usage:**
112115

113116
```sh
114-
datu count <FILE>
117+
datu count <FILE> [OPTIONS]
115118
```
116119

120+
**Options:**
121+
122+
| Option | Description |
123+
|--------|-------------|
124+
| `--has-headers [BOOL]` | For CSV input: whether the first row is a header. Default: true when omitted. Use `--has-headers=false` for headerless CSV. |
125+
117126
**Examples:**
118127

119128
```sh
120129
# Count rows in a Parquet file
121130
datu count data.parquet
122131

123-
# Count rows in an Avro or ORC file
132+
# Count rows in an Avro, CSV, or ORC file
124133
datu count events.avro
134+
datu count data.csv
125135
datu count data.orc
136+
137+
# Count rows in a headerless CSV file
138+
datu count data.csv --has-headers=false
126139
```
127140

128141
---
@@ -131,7 +144,7 @@ datu count data.orc
131144

132145
Convert data between supported formats. Input and output formats are inferred from file extensions.
133146

134-
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
147+
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`).
135148

136149
**Supported output formats:** CSV (`.csv`), JSON (`.json`), Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`), XLSX (`.xlsx`).
137150

@@ -149,23 +162,30 @@ datu convert <INPUT> <OUTPUT> [OPTIONS]
149162
| `--limit <N>` | Maximum number of records to read from the input. |
150163
| `--sparse` | For JSON/YAML: omit keys with null/missing values. Default: true. Use `--sparse=false` to include default values (e.g. empty string). |
151164
| `--json-pretty` | When converting to JSON, format output with indentation and newlines. Ignored for other output formats. |
165+
| `--has-headers [BOOL]` | For CSV input: whether the first row is a header. Default: true when omitted. Use `--has-headers=false` for headerless CSV. |
152166

153167
**Examples:**
154168

155169
```sh
156170
# Parquet to CSV (all columns)
157171
datu convert data.parquet data.csv
158172

173+
# CSV to Parquet (with automatic type inference)
174+
datu convert data.csv data.parquet
175+
159176
# Parquet to Avro (first 1000 rows)
160177
datu convert data.parquet data.avro --limit 1000
161178

162179
# Avro to CSV, only specific columns
163180
datu convert events.avro events.csv --select id,timestamp,user_id
164181

182+
# CSV to JSON with headerless input
183+
datu convert data.csv output.json --has-headers=false
184+
165185
# Parquet to Parquet with column subset
166186
datu convert input.parq output.parquet --select one,two,three
167187

168-
# Parquet, Avro, or ORC to Excel (.xlsx)
188+
# Parquet, Avro, CSV, or ORC to Excel (.xlsx)
169189
datu convert data.parquet report.xlsx
170190

171191
# Parquet or Avro to ORC
@@ -179,9 +199,9 @@ datu convert data.parquet data.json
179199

180200
### `head`
181201

182-
Print the first N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use `--output` for other formats).
202+
Print the first N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use `--output` for other formats).
183203

184-
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
204+
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`).
185205

186206
**Usage:**
187207

@@ -197,6 +217,7 @@ datu head <INPUT> [OPTIONS]
197217
| `--output <FORMAT>` | Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`. |
198218
| `--sparse` | For JSON/YAML: omit keys with null/missing values. Default: true. Use `--sparse=false` to include default values. |
199219
| `--select <COLUMNS>...` | Columns to include. If not specified, all columns are printed. Same format as `convert --select`. |
220+
| `--has-headers [BOOL]` | For CSV input: whether the first row is a header. Default: true when omitted. Use `--has-headers=false` for headerless CSV. |
200221

201222
**Examples:**
202223

@@ -207,21 +228,25 @@ datu head data.parquet
207228
# First 100 rows
208229
datu head data.parquet -n 100
209230
datu head data.avro --number 100
231+
datu head data.csv -n 100
210232
datu head data.orc --number 100
211233

212234
# First 20 rows, specific columns
213235
datu head data.parquet -n 20 --select id,name,email
236+
237+
# Head from a headerless CSV file
238+
datu head data.csv --has-headers=false
214239
```
215240

216241
---
217242

218243
### `tail`
219244

220-
Print the last N rows of a Parquet, Avro, or ORC file to stdout (default CSV; use `--output` for other formats).
245+
Print the last N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use `--output` for other formats).
221246

222-
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
247+
**Supported input formats:** Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`).
223248

224-
> **Note:** For Avro files, `tail` requires a full file scan since Avro does not support random access to the end of the file.
249+
> **Note:** For Avro and CSV files, `tail` requires a full file scan since these formats do not support random access to the end of the file.
225250
226251
**Usage:**
227252

@@ -237,6 +262,7 @@ datu tail <INPUT> [OPTIONS]
237262
| `--output <FORMAT>` | Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`. |
238263
| `--sparse` | For JSON/YAML: omit keys with null/missing values. Default: true. Use `--sparse=false` to include default values. |
239264
| `--select <COLUMNS>...` | Columns to include. If not specified, all columns are printed. Same format as `convert --select`. |
265+
| `--has-headers [BOOL]` | For CSV input: whether the first row is a header. Default: true when omitted. Use `--has-headers=false` for headerless CSV. |
240266

241267
**Examples:**
242268

@@ -247,6 +273,7 @@ datu tail data.parquet
247273
# Last 50 rows
248274
datu tail data.parquet -n 50
249275
datu tail data.avro --number 50
276+
datu tail data.csv -n 50
250277
datu tail data.orc --number 50
251278

252279
# Last 20 rows, specific columns
@@ -285,10 +312,11 @@ read("input") |> ... |> write("output")
285312

286313
#### `read(path)`
287314

288-
Read a data file. Supported formats: Parquet (`.parquet`, `.parq`), Avro (`.avro`), ORC (`.orc`).
315+
Read a data file. Supported formats: Parquet (`.parquet`, `.parq`), Avro (`.avro`), CSV (`.csv`), ORC (`.orc`). CSV files are assumed to have a header row by default.
289316

290317
```text
291318
> read("data.parquet") |> write("data.csv")
319+
> read("data.csv") |> write("data.parquet")
292320
```
293321

294322
#### `write(path)`

features/cli/convert.feature

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,27 @@ Feature: Convert
4141
And the first line of that file should contain "one,two"
4242
And that file should have 4 lines
4343

44+
Scenario: CSV to Parquet
45+
When I run `datu convert fixtures/table.csv $TEMPDIR/table_from_csv.parquet`
46+
Then the command should succeed
47+
And the output should contain "Converting fixtures/table.csv to $TEMPDIR/table_from_csv.parquet"
48+
And the file "$TEMPDIR/table_from_csv.parquet" should exist
49+
50+
Scenario: CSV to JSON
51+
When I run `datu convert fixtures/table.csv $TEMPDIR/table_from_csv.json`
52+
Then the command should succeed
53+
And the output should contain "Converting fixtures/table.csv to $TEMPDIR/table_from_csv.json"
54+
And the file "$TEMPDIR/table_from_csv.json" should exist
55+
And the file "$TEMPDIR/table_from_csv.json" should be valid JSON
56+
And the file "$TEMPDIR/table_from_csv.json" should contain "one"
57+
And the file "$TEMPDIR/table_from_csv.json" should contain "two"
58+
59+
Scenario: CSV to Parquet with --has-headers=false
60+
When I run `datu convert fixtures/no_header.csv $TEMPDIR/no_header.parquet --has-headers=false`
61+
Then the command should succeed
62+
And the output should contain "Converting fixtures/no_header.csv to $TEMPDIR/no_header.parquet"
63+
And the file "$TEMPDIR/no_header.parquet" should exist
64+
4465
Scenario: Avro to CSV
4566
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.csv`
4667
Then the command should succeed

features/cli/count.feature

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Feature: Count
2-
Return the number of rows in a Parquet, Avro, or ORC file.
2+
Return the number of rows in a Parquet, Avro, CSV, or ORC file.
33

44
Scenario: Count Parquet
55
When I run `datu count fixtures/table.parquet`
@@ -17,3 +17,8 @@ Feature: Count
1717
When I run `datu count $TEMPDIR/userdata5.orc`
1818
Then the command should succeed
1919
And the output should contain "10"
20+
21+
Scenario: Count CSV
22+
When I run `datu count fixtures/table.csv`
23+
Then the command should succeed
24+
And the output should contain "3"

features/cli/head.feature

Lines changed: 20 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Feature: Head
2-
Print the first N rows of a Parquet, Avro, or ORC file as CSV.
2+
Print the first N rows of a Parquet, Avro, CSV, or ORC file as CSV.
33

44
Scenario: Head Parquet default (10 lines)
55
When I run `datu head fixtures/userdata.parquet`
@@ -35,6 +35,25 @@ Feature: Head
3535
And the output should have a header and 2 lines
3636
And the first line of the output should be: id,email
3737

38+
Scenario: Head CSV default (10 lines)
39+
When I run `datu head fixtures/table.csv`
40+
Then the command should succeed
41+
And the output should have a header and 3 lines
42+
And the first line of the output should contain "one"
43+
And the first line of the output should contain "two"
44+
45+
Scenario: Head CSV with -n 2
46+
When I run `datu head fixtures/table.csv -n 2`
47+
Then the command should succeed
48+
And the output should have a header and 2 lines
49+
And the first line of the output should contain "one,two"
50+
51+
Scenario: Head CSV with --select
52+
When I run `datu head fixtures/table.csv -n 2 --select two,four`
53+
Then the command should succeed
54+
And the output should have a header and 2 lines
55+
And the first line of the output should be: two,four
56+
3857
Scenario: Head ORC default (10 lines)
3958
When I run `datu convert fixtures/userdata5.avro $TEMPDIR/userdata5.orc --select id,first_name --limit 10`
4059
Then the command should succeed

features/cli/schema.feature

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Feature: Schema
2-
Display the schema of a Parquet, Avro, or ORC file.
2+
Display the schema of a Parquet, Avro, CSV, or ORC file.
33

44
Scenario: Schema Parquet default (csv output)
55
When I run `datu schema fixtures/table.parquet`
@@ -64,3 +64,9 @@ Feature: Schema
6464
Then the command should succeed
6565
And the output should contain "id"
6666
And the output should contain "first_name"
67+
68+
Scenario: Schema CSV default (csv output)
69+
When I run `datu schema fixtures/table.csv`
70+
Then the command should succeed
71+
And the output should contain "one"
72+
And the output should contain "two"

features/cli/tail.feature

Lines changed: 13 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Feature: Tail
2-
Print the last N rows of a Parquet, Avro, or ORC file as CSV.
2+
Print the last N rows of a Parquet, Avro, CSV, or ORC file as CSV.
33

44
Scenario: Tail Parquet default (10 lines)
55
When I run `datu tail fixtures/table.parquet`
@@ -12,6 +12,18 @@ Feature: Tail
1212
And the first line of the output should contain "one"
1313
And the first line of the output should contain "two"
1414

15+
Scenario: Tail CSV default
16+
When I run `datu tail fixtures/table.csv`
17+
Then the command should succeed
18+
And the first line of the output should contain "one"
19+
And the first line of the output should contain "two"
20+
21+
Scenario: Tail CSV with -n 2
22+
When I run `datu tail fixtures/table.csv -n 2`
23+
Then the command should succeed
24+
And the first line of the output should contain "one"
25+
And the output should contain "baz"
26+
1527
Scenario: Tail Avro default (10 lines)
1628
When I run `datu tail fixtures/userdata5.avro`
1729
Then the command should succeed

features/repl/conversion.feature

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,34 @@ Feature: Conversion
9797
Then the file "$TEMPDIR/userdata5.xlsx" should exist
9898
And that file should be valid XLSX
9999

100+
Scenario: CSV to Parquet
101+
When the REPL is ran and the user types:
102+
```
103+
read("fixtures/table.csv") |> write("$TEMPDIR/table_from_csv.parquet")
104+
```
105+
Then the file "$TEMPDIR/table_from_csv.parquet" should exist
106+
And that file should be valid Parquet
107+
108+
Scenario: CSV to JSON
109+
When the REPL is ran and the user types:
110+
```
111+
read("fixtures/table.csv") |> write("$TEMPDIR/table_from_csv.json")
112+
```
113+
Then the file "$TEMPDIR/table_from_csv.json" should exist
114+
And that file should be valid JSON
115+
And that file should contain "one"
116+
And that file should contain "two"
117+
118+
Scenario: CSV to CSV with select
119+
When the REPL is ran and the user types:
120+
```
121+
read("fixtures/table.csv") |> select(:two, :four) |> write("$TEMPDIR/table_csv_select.csv")
122+
```
123+
Then the file "$TEMPDIR/table_csv_select.csv" should exist
124+
And that file should be a CSV file
125+
And the first line of that file should be: "two,four"
126+
And that file should have 4 lines
127+
100128
Scenario: ORC to CSV
101129
When the REPL is ran and the user types:
102130
```

features/repl/head.feature

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,18 @@ Feature: Head
4242
And the first line of that file should contain "first_name"
4343
And that file should have 6 lines
4444

45+
Scenario: Head from CSV
46+
When the REPL is ran and the user types:
47+
```
48+
read("fixtures/table.csv") |> head(3) |> write("$TEMPDIR/head_csv.csv")
49+
```
50+
Then the file "$TEMPDIR/head_csv.csv" should exist
51+
And that file should be a CSV file
52+
And the first line of that file should be: "one,two,three,four,five,__index_level_0__"
53+
And that file should have 4 lines
54+
And that file should contain "foo"
55+
And that file should contain "bar"
56+
4557
Scenario: Head from ORC
4658
When the REPL is ran and the user types:
4759
```

features/repl/tail.feature

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -42,6 +42,18 @@ Feature: Tail
4242
And the first line of that file should contain "first_name"
4343
And that file should have 6 lines
4444

45+
Scenario: Tail from CSV
46+
When the REPL is ran and the user types:
47+
```
48+
read("fixtures/table.csv") |> tail(2) |> write("$TEMPDIR/tail_csv.csv")
49+
```
50+
Then the file "$TEMPDIR/tail_csv.csv" should exist
51+
And that file should be a CSV file
52+
And the first line of that file should be: "one,two,three,four,five,__index_level_0__"
53+
And that file should have 3 lines
54+
And that file should contain "bar"
55+
And that file should contain "baz"
56+
4557
Scenario: Tail from ORC
4658
When the REPL is ran and the user types:
4759
```

0 commit comments

Comments
 (0)