Skip to content

Shrink GTFS Bundles #176

@ankoure

Description

@ankoure

My understanding is that PR #104 is blocked because the GTFS bundles are too large and there is not enough space for a new build. If we converted the GTFS bundles to Parquet there would be a potential 10x reduction in size based off this quick and dirty experiment script.

Not totally sure how the GTFS bundles are being sourced so determining where to place the conversion process is a key detail I'm missing.

Image
# csv_to_parquet.py
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

gtfs_path = "MBTA_GTFS"
parquet_path = "./GTFS_PARQUET"

# Create output directory if it doesn't exist
os.makedirs("./parquet", exist_ok=True)

for csv_file in os.listdir(gtfs_path):
    print(csv_file)
    raw_name = csv_file.replace(".txt", "")
    parquet_file = f"./parquet/{raw_name}.parquet"
    chunksize = 100_000

    csv_stream = pd.read_csv(
        os.path.join(gtfs_path, csv_file),
        chunksize=chunksize,
        low_memory=False,
        dtype=str,  # Read all columns as strings to avoid type conflicts
    )

    parquet_writer = None
    parquet_schema = None

    for i, chunk in enumerate(csv_stream):
        if i == 0:
            # Create schema with all string types, explicitly allowing nulls
            pandas_schema = pa.Table.from_pandas(df=chunk).schema
            # Convert all types to string with null support
            parquet_schema = pa.schema(
                [
                    pa.field(field.name, pa.string(), nullable=True)
                    for field in pandas_schema
                ]
            )
            # Open a Parquet file for writing
            parquet_writer = pq.ParquetWriter(
                parquet_file, parquet_schema, compression="snappy"
            )
        # Write CSV chunk to the parquet file
        if parquet_writer:
            # Convert chunk to table with explicit string schema
            table = pa.Table.from_pandas(chunk, schema=parquet_schema)
            parquet_writer.write_table(table)
    if parquet_writer:
        parquet_writer.close()

Metadata

Metadata

Assignees

No one assigned

    Labels

    optimizationOptimizing some exisiting functionalitytech-debtAddressing technical debt

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions