Skip to content

Commit 463b0dc

Browse files
authored
Create Red Datasets Parquet skeleton (#3)
Thank you for implementing [Red Datasets](https://github.com/red-data-tools/red-datasets)! I'm so sorry it is too big to review this PR. But I didn't technically change anything about code logic. ## Related Issue - #2 ## What I did - Moved all code from [Red Datasets ](https://github.com/red-data-tools/red-datasets) to here - Removed unnecessary API and tests from about Red Datasets code - Renamed namespace and file from Red Datasets to Red Datasets Parquet - Cleared release note, version, README.md - Updated gemspec, Copyright's date and runtime dependencies ## What I didn't - Change code logic - Write the content in README.md ## What I checked - Passed `bundle exec rake` ``` red-datasets-parquet % bundle exec rake Loaded suite test Started ........... Finished in 26.732554 seconds. -------------------------------------------------------------------------------------------------- 11 tests, 11 assertions, 0 failures, 0 errors, 0 pendings, 0 omissions, 0 notifications 100% passed -------------------------------------------------------------------------------------------------- 0.41 tests/s, 0.41 assertions/s ``` <details> <summary>working notes</summary> ## What I will do - [x] Move all code from [Red Datasets ](https://github.com/red-data-tools/red-datasets) to here - [x] Remove unnecessary codes in Red Datasets Parquet gem - [x] Edit some document contents from Red Datasets to Red Datasets Parquet - Change namespace from Datasets to Datasets Parquet etc.. - [x] Check Code - [x] Write PR description </details>
1 parent 2d6dce5 commit 463b0dc

File tree

15 files changed

+446
-0
lines changed

15 files changed

+446
-0
lines changed

.github/workflows/test.yml

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
name: Test
2+
3+
on:
4+
push:
5+
pull_request:
6+
schedule:
7+
- cron: |
8+
0 0 * * 0
9+
10+
jobs:
11+
test:
12+
name: "Ruby ${{ matrix.ruby-version }}: ${{ matrix.runs-on }}"
13+
strategy:
14+
# To avoid high frequency datasets parquet download in a short time.
15+
max-parallel: 1
16+
fail-fast: false
17+
matrix:
18+
ruby-version:
19+
- "2.7"
20+
- "3.0"
21+
- "3.1"
22+
runs-on:
23+
- macos-latest
24+
- ubuntu-latest
25+
- windows-latest
26+
runs-on: ${{ matrix.runs-on }}
27+
env:
28+
# We can invalidate the current cache by updating this.
29+
CACHE_VERSION: "2022-08-27"
30+
steps:
31+
- uses: actions/checkout@v3
32+
- uses: ruby/setup-ruby@v1
33+
with:
34+
ruby-version: ${{ matrix.ruby-version }}
35+
- uses: actions/cache@v3
36+
if: |
37+
runner.os == 'Linux'
38+
with:
39+
path: |
40+
~/.cache/red-datasets
41+
key: ${{ env.CACHE_VERSION }}-${{ runner.os }}-${{ hashFiles('lib/**') }}
42+
restore-keys: |
43+
${{ env.CACHE_VERSION }}-${{ runner.os }}-
44+
- uses: actions/cache@v3
45+
if: |
46+
runner.os == 'macOS'
47+
with:
48+
path: |
49+
~/Library/Caches/red-datasets
50+
key: ${{ env.CACHE_VERSION }}-${{ runner.os }}-${{ hashFiles('lib/**') }}
51+
restore-keys: |
52+
${{ env.CACHE_VERSION }}-${{ runner.os }}-
53+
- uses: actions/cache@v3
54+
if: |
55+
runner.os == 'Windows'
56+
with:
57+
path: |
58+
~/AppData/Local/red-datasets
59+
key: ${{ env.CACHE_VERSION }}-${{ runner.os }}-${{ hashFiles('lib/**') }}
60+
restore-keys: |
61+
${{ env.CACHE_VERSION }}-${{ runner.os }}-
62+
- name: Prepare the Apache Arrow APT repository
63+
if: |
64+
runner.os == 'Linux'
65+
run: |
66+
sudo apt update
67+
sudo apt install -y -V ca-certificates lsb-release wget
68+
wget https://apache.jfrog.io/artifactory/arrow/$(lsb_release --id --short | tr 'A-Z' 'a-z')/apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
69+
sudo apt install -y -V ./apache-arrow-apt-source-latest-$(lsb_release --codename --short).deb
70+
sudo apt update
71+
- name: Install dependencies
72+
run: |
73+
bundle install
74+
- name: Test
75+
run: |
76+
bundle exec rake

.yardopts

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
--output-dir doc/reference/en
2+
--markup markdown
3+
--markup-provider kramdown
4+
lib/**/*.rb
5+
-
6+
doc/text/**/*

Gemfile

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# -*- ruby -*-
2+
3+
source "https://rubygems.org/"
4+
5+
gemspec

LICENSE.txt

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
Copyright 2022 Kouhei Sutou <[email protected]>
2+
Copyright 2022 otegami <[email protected]>
3+
4+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
5+
6+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
7+
8+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

README.md

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
# Red Datasets Parquet
2+
3+
## Description
4+
5+
## Install
6+
7+
## Available datasets
8+
9+
## Usage
10+
11+
## How to develop Red Datasets Parquet
12+
13+
1. Fork <https://github.com/red-data-tools/red-datasets-parquet>
14+
2. Create a feature branch from main
15+
3. Develop in the feature branch
16+
4. Pull request from the feature branch to <https://github.com/red-data-tools/red-datasets-parquet>
17+
18+
## License
19+
20+
The MIT license. See `LICENSE.txt` for details.

Rakefile

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# -*- ruby -*-
2+
3+
require "rubygems"
4+
require "bundler/gem_helper"
5+
6+
base_dir = File.join(File.dirname(__FILE__))
7+
8+
helper = Bundler::GemHelper.new(base_dir)
9+
def helper.version_tag
10+
version
11+
end
12+
13+
helper.install
14+
spec = helper.gemspec
15+
16+
desc "Run tests"
17+
task :test do
18+
ruby("test/run-test.rb")
19+
end
20+
21+
task default: :test

doc/text/news.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
# News

example/tlc-yellow-taxi-trip.rb

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
#!/usr/bin/env ruby
2+
3+
require "datasets-parquet"
4+
5+
trips = DatasetsParquet::TLC::YellowTaxiTrip.new(year: 2022, month: 1)
6+
trips.each do |trip|
7+
p [
8+
trip.vendor,
9+
trip.tpep_pickup_datetime,
10+
trip.tpep_dropoff_datetime,
11+
trip.passenger_count,
12+
trip.trip_distance,
13+
trip.rate_code,
14+
trip.store_and_fwd?,
15+
trip.pu_location_id,
16+
trip.do_location_id,
17+
trip.payment,
18+
trip.fare_amount,
19+
trip.extra,
20+
trip.mta_tax,
21+
trip.tip_amount,
22+
trip.tolls_amount,
23+
trip.improvement_surcharge,
24+
trip.total_amount,
25+
trip.congestion_surcharge,
26+
trip.airport_fee
27+
]
28+
# [:creative_mobile_technologies, 2022-01-01 09:35:40 +0900, 2022-01-01 09:53:29 +0900, 2.0, 3.8, :standard_rate, false, 142, 236, :credit_card, 14.5, 3.0, 0.5, 3.65, 0.0, 0.3, 21.95, 2.5, 0.0]
29+
# [:creative_mobile_technologies, 2022-01-01 09:33:43 +0900, 2022-01-01 09:42:07 +0900, 1.0, 2.1, :standard_rate, false, 236, 42, :credit_card, 8.0, 0.5, 0.5, 4.0, 0.0, 0.3, 13.3, 0.0, 0.0]
30+
end

lib/datasets-parquet.rb

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
require "datasets"
2+
require "parquet"
3+
4+
require_relative "datasets-parquet/version"
5+
6+
require_relative "datasets-parquet/tlc/yellow-taxi-trip"
Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,109 @@
1+
module DatasetsParquet
2+
module TLC
3+
class YellowTaxiTrip < Datasets::Dataset
4+
class Record < Struct.new(:vendor,
5+
:tpep_pickup_datetime,
6+
:tpep_dropoff_datetime,
7+
:passenger_count,
8+
:trip_distance,
9+
:rate_code,
10+
:store_and_fwd,
11+
:pu_location_id,
12+
:do_location_id,
13+
:payment,
14+
:fare_amount,
15+
:extra,
16+
:mta_tax,
17+
:tip_amount,
18+
:tolls_amount,
19+
:improvement_surcharge,
20+
:total_amount,
21+
:congestion_surcharge,
22+
:airport_fee)
23+
alias_method :store_and_fwd?, :store_and_fwd
24+
25+
def initialize(*values)
26+
super()
27+
members.zip(values) do |member, value|
28+
__send__("#{member}=", value)
29+
end
30+
end
31+
32+
def vendor=(vendor)
33+
super(vendor == 1 ? :creative_mobile_technologies : :veri_fone_inc)
34+
end
35+
36+
def rate_code=(rate_code)
37+
case rate_code
38+
when 1.0
39+
super(:standard_rate)
40+
when 2.0
41+
super(:jfk)
42+
when 3.0
43+
super(:newark)
44+
when 4.0
45+
super(:Nassau_or_westchester)
46+
when 5.0
47+
super(:negotiated_fare)
48+
when 6.0
49+
super(:group_ride)
50+
end
51+
end
52+
53+
def store_and_fwd=(store_and_fwd)
54+
super(store_and_fwd == 'Y')
55+
end
56+
57+
def payment=(payment)
58+
case payment
59+
when 1
60+
super(:credit_card)
61+
when 2
62+
super(:cash)
63+
when 3
64+
super(:no_charge)
65+
when 4
66+
super(:dispute)
67+
when 5
68+
super(:unknown)
69+
when 6
70+
super(:voided_trip)
71+
end
72+
end
73+
end
74+
75+
def initialize(year: Date.today.year, month: Date.today.month)
76+
super()
77+
@metadata.id = "nyc-taxi-and-limousine-commission-yello-taxi-trip"
78+
@metadata.name = "New York city Taxi and Limousine Commission: yellow taxi trip record dataset"
79+
@metadata.url = "https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page"
80+
@metadata.licenses = [
81+
{
82+
name: "NYC Open Data Terms of Use",
83+
url: "https://opendata.cityofnewyork.us/overview/#termsofuse",
84+
}
85+
]
86+
@year = year
87+
@month = month
88+
end
89+
90+
def each
91+
return to_enum(__method__) unless block_given?
92+
93+
open_data.raw_records.each do |raw_record|
94+
record = Record.new(*raw_record)
95+
yield(record)
96+
end
97+
end
98+
99+
private
100+
def open_data
101+
base_name = "yellow_tripdata_%04d-%02d.parquet" % [@year, @month]
102+
data_path = cache_dir_path + base_name
103+
data_url = "https://d37ci6vzurychx.cloudfront.net/trip-data/#{base_name}"
104+
download(data_path, data_url)
105+
Arrow::Table.load(data_path)
106+
end
107+
end
108+
end
109+
end

0 commit comments

Comments
 (0)