-
Notifications
You must be signed in to change notification settings - Fork 4
Open
Labels
optimizationOptimizing some exisiting functionalityOptimizing some exisiting functionalitytech-debtAddressing technical debtAddressing technical debt
Description
My understanding is that PR #104 is blocked because the GTFS bundles are too large and there is not enough space for a new build. If we converted the GTFS bundles to Parquet there would be a potential 10x reduction in size based off this quick and dirty experiment script.
Not totally sure how the GTFS bundles are being sourced so determining where to place the conversion process is a key detail I'm missing.
# csv_to_parquet.py
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
gtfs_path = "MBTA_GTFS"
parquet_path = "./GTFS_PARQUET"
# Create output directory if it doesn't exist
os.makedirs("./parquet", exist_ok=True)
for csv_file in os.listdir(gtfs_path):
print(csv_file)
raw_name = csv_file.replace(".txt", "")
parquet_file = f"./parquet/{raw_name}.parquet"
chunksize = 100_000
csv_stream = pd.read_csv(
os.path.join(gtfs_path, csv_file),
chunksize=chunksize,
low_memory=False,
dtype=str, # Read all columns as strings to avoid type conflicts
)
parquet_writer = None
parquet_schema = None
for i, chunk in enumerate(csv_stream):
if i == 0:
# Create schema with all string types, explicitly allowing nulls
pandas_schema = pa.Table.from_pandas(df=chunk).schema
# Convert all types to string with null support
parquet_schema = pa.schema(
[
pa.field(field.name, pa.string(), nullable=True)
for field in pandas_schema
]
)
# Open a Parquet file for writing
parquet_writer = pq.ParquetWriter(
parquet_file, parquet_schema, compression="snappy"
)
# Write CSV chunk to the parquet file
if parquet_writer:
# Convert chunk to table with explicit string schema
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
if parquet_writer:
parquet_writer.close()Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
optimizationOptimizing some exisiting functionalityOptimizing some exisiting functionalitytech-debtAddressing technical debtAddressing technical debt