parquet-ruby

Read and write Apache Parquet files from Ruby. This gem wraps the official Apache parquet rust crate, providing:

High performance columnar data storage and retrieval
Memory-efficient streaming APIs for large datasets
Full compatibility with the Apache Parquet specification
Simple, Ruby-native APIs that feel natural

Why Use This Library?

Apache Parquet is the de facto standard for analytical data storage, offering:

Efficient compression - typically 2-10x smaller than CSV
Fast columnar access - read only the columns you need
Rich type system - preserves data types, including nested structures
Wide ecosystem support - works with Spark, Pandas, DuckDB, and more

Installation

Add this line to your application's Gemfile:

gem 'parquet'

Then execute:

$ bundle install

Or install it directly:

$ gem install parquet

Quick Start

Reading Data

require "parquet"

# Read Parquet files row by row
Parquet.each_row("data.parquet") do |row|
  puts row  # => {"id" => 1, "name" => "Alice", "score" => 95.5}
end

# Or column by column for better performance
Parquet.each_column("data.parquet", batch_size: 1000) do |batch|
  puts batch  # => {"id" => [1, 2, ...], "name" => ["Alice", "Bob", ...]}
end

Writing Data

# Define your schema
schema = [
  { "id" => "int64" },
  { "name" => "string" },
  { "score" => "double" }
]

# Write row by row
rows = [
  [1, "Alice", 95.5],
  [2, "Bob", 82.3]
]

Parquet.write_rows(rows.each, schema: schema, write_to: "output.parquet")

Reading Parquet Files

The library provides two APIs for reading data, each optimized for different use cases:

Row-wise Reading (Sequential Access)

Best for: Processing records one at a time, data transformations, ETL pipelines

# Basic usage - returns hashes
Parquet.each_row("data.parquet") do |row|
  puts row  # => {"id" => 1, "name" => "Alice"}
end

# Memory-efficient array format
Parquet.each_row("data.parquet", result_type: :array) do |row|
  puts row  # => [1, "Alice"]
end

# Read specific columns only
Parquet.each_row("data.parquet", columns: ["id", "name"]) do |row|
  # Only requested columns are loaded from disk
end

# Works with IO objects
File.open("data.parquet", "rb") do |file|
  Parquet.each_row(file) do |row|
    # Process row
  end
end

Column-wise Reading (Analytical Access)

Best for: Analytics, aggregations, when you need few columns from wide tables

# Process data in column batches
Parquet.each_column("data.parquet", batch_size: 1000) do |batch|
  # batch is a hash of column_name => array_of_values
  ids = batch["id"]      # => [1, 2, 3, ..., 1000]
  names = batch["name"]  # => ["Alice", "Bob", ...]

  # Perform columnar operations
  avg_id = ids.sum.to_f / ids.length
end

# Array format for more control
Parquet.each_column("data.parquet",
                    result_type: :array,
                    columns: ["id", "name"]) do |batch|
  # batch is an array of arrays
  # [[1, 2, ...], ["Alice", "Bob", ...]]
end

File Metadata

Inspect file structure without reading data:

metadata = Parquet.metadata("data.parquet")

puts metadata["num_rows"]           # Total row count
puts metadata["created_by"]         # Writer identification
puts metadata["schema"]["fields"]   # Column definitions
puts metadata["row_groups"].size    # Number of row groups

Writing Parquet Files

Row-wise Writing

Best for: Streaming data, converting from other formats, memory-constrained environments

# Basic schema definition
schema = [
  { "id" => "int64" },
  { "name" => "string" },
  { "active" => "boolean" },
  { "balance" => "double" }
]

# Stream data from any enumerable
rows = CSV.foreach("input.csv").map do |row|
  [row[0].to_i, row[1], row[2] == "true", row[3].to_f]
end

Parquet.write_rows(rows,
  schema: schema,
  write_to: "output.parquet",
  batch_size: 5000  # Rows per batch (default: 1000)
)

Column-wise Writing

Best for: Pre-columnar data, better compression, higher performance

# Prepare columnar data
ids = [1, 2, 3, 4, 5]
names = ["Alice", "Bob", "Charlie", "Diana", "Eve"]
scores = [95.5, 82.3, 88.7, 91.2, 79.8]

# Create batches
batches = [[
  ids,    # First column
  names,  # Second column
  scores  # Third column
]]

schema = [
  { "id" => "int64" },
  { "name" => "string" },
  { "score" => "double" }
]

Parquet.write_columns(batches.each,
  schema: schema,
  write_to: "output.parquet",
  compression: "snappy"  # Options: none, snappy, gzip, lz4, zstd
)

Data Types

Basic Types

schema = [
  # Integers
  { "tiny" => "int8" },         # -128 to 127
  { "small" => "int16" },       # -32,768 to 32,767
  { "medium" => "int32" },      # ±2 billion
  { "large" => "int64" },       # ±9 quintillion

  # Unsigned integers
  { "ubyte" => "uint8" },       # 0 to 255
  { "ushort" => "uint16" },     # 0 to 65,535
  { "uint" => "uint32" },       # 0 to 4 billion
  { "ulong" => "uint64" },      # 0 to 18 quintillion

  # Floating point
  { "price" => "float" },       # 32-bit precision
  { "amount" => "double" },     # 64-bit precision

  # Other basics
  { "name" => "string" },
  { "data" => "binary" },
  { "active" => "boolean" }
]

Date and Time Types

schema = [
  # Date (days since Unix epoch)
  { "date" => "date32" },

  # Timestamps (with different precisions)
  { "created_sec" => "timestamp_second" },
  { "created_ms" => "timestamp_millis" },    # Most common
  { "created_us" => "timestamp_micros" },
  { "created_ns" => "timestamp_nanos" },

  # Time of day (without date)
  { "time_ms" => "time_millis" },    # Milliseconds since midnight
  { "time_us" => "time_micros" }     # Microseconds since midnight
]

Decimal Type (Financial Data)

For exact decimal arithmetic (no floating-point errors):

require "bigdecimal"

schema = [
  # Financial amounts with 2 decimal places
  { "price" => "decimal", "precision" => 10, "scale" => 2 },  # Up to 99,999,999.99
  { "balance" => "decimal", "precision" => 15, "scale" => 2 }, # Larger amounts

  # High-precision calculations
  { "rate" => "decimal", "precision" => 10, "scale" => 8 }     # 8 decimal places
]

# Use BigDecimal for exact values
data = [[
  BigDecimal("19.99"),
  BigDecimal("1234567.89"),
  BigDecimal("0.00000123")
]]

Complex Data Structures

The library includes a powerful Schema DSL for defining nested data:

Using the Schema DSL

schema = Parquet::Schema.define do
  # Simple fields
  field :id, :int64, nullable: false      # Required field
  field :name, :string                    # Optional by default

  # Nested structure
  field :address, :struct do
    field :street, :string
    field :city, :string
    field :location, :struct do
      field :lat, :double
      field :lng, :double
    end
  end

  # Lists
  field :tags, :list, item: :string
  field :scores, :list, item: :int32

  # Maps (dictionaries)
  field :metadata, :map, key: :string, value: :string

  # Complex combinations
  field :contacts, :list, item: :struct do
    field :name, :string
    field :email, :string
    field :primary, :boolean
  end
end

Writing Complex Data

data = [[
  1,                              # id
  "Alice Johnson",                # name
  {                               # address
    "street" => "123 Main St",
    "city" => "Springfield",
    "location" => {
      "lat" => 40.7128,
      "lng" => -74.0060
    }
  },
  ["ruby", "parquet", "data"],    # tags
  [85, 92, 88],                   # scores
  { "dept" => "Engineering" },    # metadata
  [                               # contacts
    { "name" => "Bob", "email" => "[email protected]", "primary" => true },
    { "name" => "Carol", "email" => "[email protected]", "primary" => false }
  ]
]]

Parquet.write_rows(data.each, schema: schema, write_to: "complex.parquet")

⚠️ Important Limitations

Timezone Handling in Parquet

The Parquet specification has a fundamental limitation with timezone storage:

UTC-normalized: Any timestamp with timezone info (including "+09:00" or "America/New_York") is converted to UTC
Local/unzoned: Timestamps without timezone info are stored as-is

The original timezone information is permanently lost. This is not a limitation of this library but of the Parquet format itself.

schema = Parquet::Schema.define do
  # These BOTH store in UTC - timezone info is lost!
  field :timestamp_utc, :timestamp_millis, timezone: "UTC"
  field :timestamp_tokyo, :timestamp_millis, timezone: "+09:00"

  # This stores as local time (no timezone)
  field :timestamp_local, :timestamp_millis
end

# If you need timezone preservation, store it separately:
schema = Parquet::Schema.define do
  field :timestamp, :timestamp_millis, has_timezone: true  # UTC storage
  field :original_tz, :string                              # "America/New_York"
end

Performance Tips

Use column-wise reading when you need only a few columns from wide tables
Specify columns parameter to avoid reading unnecessary data
Choose appropriate batch sizes:
- Larger batches = better throughput but more memory
- Smaller batches = less memory but more overhead
Pre-sort data by commonly filtered columns for better compression

Memory Management

Control memory usage with flush thresholds:

Parquet.write_rows(huge_dataset.each,
  schema: schema,
  write_to: "output.parquet",
  batch_size: 1000,              # Rows before considering flush
  flush_threshold: 32 * 1024**2  # Flush if batch exceeds 32MB
)

Architecture

This gem uses a modular architecture:

parquet-core: Language-agnostic Rust core for Parquet operations
parquet-ruby-adapter: Ruby-specific FFI adapter layer
parquet gem: High-level Ruby API

Take a look at ARCH.md

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/njaremko/parquet-ruby.

License

The gem is available as open source under the terms of the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.cargo		.cargo
.github		.github
benchmark		benchmark
ext		ext
lib		lib
test		test
.envrc		.envrc
.gitignore		.gitignore
ARCH.md		ARCH.md
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE		LICENSE
LogicalTypes.md		LogicalTypes.md
README.md		README.md
Rakefile		Rakefile
flake.lock		flake.lock
flake.nix		flake.nix
overlay.nix		overlay.nix
parquet.gemspec		parquet.gemspec
run_memory_tests.sh		run_memory_tests.sh

Uh oh!

License

njaremko/parquet-ruby

Folders and files

Latest commit

History

Repository files navigation

parquet-ruby

Why Use This Library?

Installation

Quick Start

Reading Data

Writing Data

Reading Parquet Files

Row-wise Reading (Sequential Access)

Column-wise Reading (Analytical Access)

File Metadata

Writing Parquet Files

Row-wise Writing

Column-wise Writing

Data Types

Basic Types

Date and Time Types

Decimal Type (Financial Data)

Complex Data Structures

Using the Schema DSL

Writing Complex Data

⚠️ Important Limitations

Timezone Handling in Parquet

Performance Tips

Memory Management

Architecture

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Contributors 2

Uh oh!

Languages

Packages