Shrink GTFS Bundles

My understanding is that PR #104 is blocked because the GTFS bundles are too large and there is not enough space for a new build. If we converted the GTFS bundles to Parquet there would be a potential 10x reduction in size based off this quick and dirty experiment script. 

Not totally sure how the GTFS bundles are being sourced so determining where to place the conversion process is a key detail I'm missing. 
 
<img width="540" height="101" alt="Image" src="https://github.com/user-attachments/assets/3ca8dbf8-2392-4ad1-9369-21a870605161" />

```python
# csv_to_parquet.py
import os
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

gtfs_path = "MBTA_GTFS"
parquet_path = "./GTFS_PARQUET"

# Create output directory if it doesn't exist
os.makedirs("./parquet", exist_ok=True)

for csv_file in os.listdir(gtfs_path):
    print(csv_file)
    raw_name = csv_file.replace(".txt", "")
    parquet_file = f"./parquet/{raw_name}.parquet"
    chunksize = 100_000

    csv_stream = pd.read_csv(
        os.path.join(gtfs_path, csv_file),
        chunksize=chunksize,
        low_memory=False,
        dtype=str,  # Read all columns as strings to avoid type conflicts
    )

    parquet_writer = None
    parquet_schema = None

    for i, chunk in enumerate(csv_stream):
        if i == 0:
            # Create schema with all string types, explicitly allowing nulls
            pandas_schema = pa.Table.from_pandas(df=chunk).schema
            # Convert all types to string with null support
            parquet_schema = pa.schema(
                [
                    pa.field(field.name, pa.string(), nullable=True)
                    for field in pandas_schema
                ]
            )
            # Open a Parquet file for writing
            parquet_writer = pq.ParquetWriter(
                parquet_file, parquet_schema, compression="snappy"
            )
        # Write CSV chunk to the parquet file
        if parquet_writer:
            # Convert chunk to table with explicit string schema
            table = pa.Table.from_pandas(chunk, schema=parquet_schema)
            parquet_writer.write_table(table)
    if parquet_writer:
        parquet_writer.close()

```  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink GTFS Bundles #176

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Shrink GTFS Bundles #176

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions