datu - a data file utility

Datu (Filipino) - a traditional chief or local leader

datu is intended to be a lightweight, fast, and versatile CLI tool for reading, querying, and converting data in various file formats, such as Parquet, Avro, ORC, CSV, JSON, YAML, and .XLSX.

Installation

Prerequisites: Rust ~> 1.95 (or recent stable)

cargo install datu

To install from source:

cargo install --git https://github.com/aisrael/datu

Supported Formats

Format	Read	Write	Display
Parquet (`.parquet`, `.parq`)	✓	✓	—
Avro (`.avro`)	✓	✓	—
ORC (`.orc`)	✓	✓	—
CSV (`.csv`)	✓	✓	✓
XLSX (`.xlsx`)	—	✓	—
JSON (`.json`)	—	✓	✓
JSON (pretty)	—	—	✓
YAML	—	—	✓

Read — Input file formats for convert, count, schema, head, and tail.
Write — Output file formats for convert.
Display — Output format when printing to stdout (schema, head, tail via --output: csv, json, json-pretty, yaml).

CSV options: When reading CSV files, the --has-headers option controls whether the first row is treated as column names. Omitted or --has-headers means true (header present); --has-headers=false for headerless CSV. Applies to convert, count, schema, head, and tail.

Usage

datu can be used non-interactively as a typical command-line utility, or it can be ran without specifying a command in interactive mode, providing a REPL-like interface.

For example, the command

datu convert table.parquet --select id,email table.csv

And, interactively, using the REPL

datu
> read("table.parquet") |> select(:id, :email) |> write("table.csv")

Perform the same conversion and column filtering.

Commands

`schema`

Display the schema of a Parquet, Avro, CSV, or ORC file (column names, types, and nullability). Useful for inspecting file structure without reading data. CSV schema uses type inference from the data.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage:

datu schema <FILE> [OPTIONS]

Options:

Option	Description
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--has-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--has-headers=false` for headerless CSV.

Output formats:

csv (default): One line per column, e.g. name: String (UTF8), nullable.
json: JSON array of objects with name, data_type, nullable, and optionally converted_type (Parquet).
json-pretty: Same as json but pretty-printed for readability.
yaml: YAML list of mappings with the same fields.

Examples:

# Default CSV-style output
datu schema data.parquet

# JSON output
datu schema data.parquet --output json

# JSON pretty-printed
datu schema data.parquet --output json-pretty

# YAML output (e.g. for config or tooling)
datu schema events.avro --output yaml
datu schema events.avro -o YAML

`count`

Return the number of rows in a Parquet, Avro, CSV, or ORC file.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage:

datu count <FILE> [OPTIONS]

Options:

Option	Description
`--has-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--has-headers=false` for headerless CSV.

Examples:

# Count rows in a Parquet file
datu count data.parquet

# Count rows in an Avro, CSV, or ORC file
datu count events.avro
datu count data.csv
datu count data.orc

# Count rows in a headerless CSV file
datu count data.csv --has-headers=false

`convert`

Convert data between supported formats. Input and output formats are inferred from file extensions.

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Supported output formats: CSV (.csv), JSON (.json), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).

Usage:

datu convert <INPUT> <OUTPUT> [OPTIONS]

Options:

Option	Description
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are written. Column names can be given as multiple arguments or as comma-separated values (e.g. `--select id,name,email` or `--select id --select name --select email`).
`--limit <N>`	Maximum number of records to read from the input.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values (e.g. empty string).
`--json-pretty`	When converting to JSON, format output with indentation and newlines. Ignored for other output formats.
`--has-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--has-headers=false` for headerless CSV.

Examples:

# Parquet to CSV (all columns)
datu convert data.parquet data.csv

# CSV to Parquet (with automatic type inference)
datu convert data.csv data.parquet

# Parquet to Avro (first 1000 rows)
datu convert data.parquet data.avro --limit 1000

# Avro to CSV, only specific columns
datu convert events.avro events.csv --select id,timestamp,user_id

# CSV to JSON with headerless input
datu convert data.csv output.json --has-headers=false

# Parquet to Parquet with column subset
datu convert input.parq output.parquet --select one,two,three

# Parquet, Avro, CSV, or ORC to Excel (.xlsx)
datu convert data.parquet report.xlsx

# Parquet or Avro to ORC
datu convert data.parquet data.orc

# Parquet or Avro to JSON
datu convert data.parquet data.json

`head`

Print the first N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Usage:

datu head <INPUT> [OPTIONS]

Options:

Option	Description
`-n`, `--number <N>`	Number of rows to print. Default: 10.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are printed. Same format as `convert --select`.
`--has-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--has-headers=false` for headerless CSV.

Examples:

# First 10 rows (default)
datu head data.parquet

# First 100 rows
datu head data.parquet -n 100
datu head data.avro --number 100
datu head data.csv -n 100
datu head data.orc --number 100

# First 20 rows, specific columns
datu head data.parquet -n 20 --select id,name,email

# Head from a headerless CSV file
datu head data.csv --has-headers=false

`tail`

Print the last N rows of a Parquet, Avro, CSV, or ORC file to stdout (default CSV; use --output for other formats).

Supported input formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc).

Note: For Avro and CSV files, tail requires a full file scan since these formats do not support random access to the end of the file.

Usage:

datu tail <INPUT> [OPTIONS]

Options:

Option	Description
`-n`, `--number <N>`	Number of rows to print. Default: 10.
`--output <FORMAT>`	Output format: `csv`, `json`, `json-pretty`, or `yaml`. Case insensitive. Default: `csv`.
`--sparse`	For JSON/YAML: omit keys with null/missing values. Default: `true`. Use `--sparse=false` to include default values.
`--select <COLUMNS>...`	Columns to include. If not specified, all columns are printed. Same format as `convert --select`.
`--has-headers [BOOL]`	For CSV input: whether the first row is a header. Default: `true` when omitted. Use `--has-headers=false` for headerless CSV.

Examples:

# Last 10 rows (default)
datu tail data.parquet

# Last 50 rows
datu tail data.parquet -n 50
datu tail data.avro --number 50
datu tail data.csv -n 50
datu tail data.orc --number 50

# Last 20 rows, specific columns
datu tail data.parquet -n 20 --select id,name,email

# Redirect tail output to a file
datu tail data.parquet -n 1000 > last1000.csv

Version

Print the installed datu version:

datu version

Interactive Mode (REPL)

Running datu without any command starts an interactive REPL (Read-Eval-Print Loop):

datu
>

In the REPL, you compose data pipelines using the |> (pipe) operator to chain functions together. The general pattern is:

read("input") |> ... |> write("output")

Functions

`read(path)`

Read a data file. Supported formats: Parquet (.parquet, .parq), Avro (.avro), CSV (.csv), ORC (.orc). CSV files are assumed to have a header row by default.

> read("data.parquet") |> write("data.csv")
> read("data.csv") |> write("data.parquet")

`write(path)`

Write data to a file. The output format is inferred from the file extension. Supported formats: CSV (.csv), JSON (.json), YAML (.yaml), Parquet (.parquet, .parq), Avro (.avro), ORC (.orc), XLSX (.xlsx).

> read("data.parquet") |> write("output.json")

`select(columns...)`

Select and reorder columns. Columns can be specified using symbol syntax (:name) or string syntax ("name").

> read("data.parquet") |> select(:id, :email) |> write("subset.csv")
> read("data.parquet") |> select("id", "email") |> write("subset.csv")

Columns appear in the output in the order they are listed, so select can also be used to reorder columns:

> read("data.parquet") |> select(:email, :id) |> write("reordered.csv")

`head(n)`

Take the first n rows.

> read("data.parquet") |> head(10) |> write("first10.csv")

`tail(n)`

Take the last n rows.

> read("data.parquet") |> tail(10) |> write("last10.csv")

Composing Pipelines

Functions can be chained in any order to build more complex pipelines:

> read("users.avro") |> select(:id, :first_name, :email) |> head(5) |> write("top5.json")
> read("data.parquet") |> select(:two, :one) |> tail(1) |> write("last_row.csv")

How it Works Internally

Internally, datu constructs a pipeline based on the command and arguments.

For example, the following invocation

datu convert input.parquet output.csv --select id,name,email

constructs a pipeline that's composed of:

a parquet reader step that reads the input.parquet file then chains to
a "select column" step that filters for only the id, name, and email columns, then finally
a CSV writer step, that writes the id, name, and email columns from input.parquet to output.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datu - a data file utility

Installation

Supported Formats

Usage

Commands

`schema`

`count`

`convert`

`head`

`tail`

Version

Interactive Mode (REPL)

Functions

`read(path)`

`write(path)`

`select(columns...)`

`head(n)`

`tail(n)`

Composing Pipelines

How it Works Internally

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

datu - a data file utility

Installation

Supported Formats

Usage

Commands

schema

count

convert

head

tail

Version

Interactive Mode (REPL)

Functions

read(path)

write(path)

select(columns...)

head(n)

tail(n)

Composing Pipelines

How it Works Internally

`schema`

`count`

`convert`

`head`

`tail`

`read(path)`

`write(path)`

`select(columns...)`

`head(n)`

`tail(n)`