Skip to content

Conversation

@jcpitre
Copy link
Collaborator

@jcpitre jcpitre commented Nov 18, 2025

Summary:
Closes #898
Changed the way we obtain the feeds from the DB. Reduced batch size from 500 to 50. Since this is likely to take more time to execute, increased the timeout from 10 minutes to 1 hour (but 1 hour might be an overkill).
Modified the way the joins are done while querying the DB so it uses less memory.

Note: This PR also affects the API calls. All tests pass, but extra attention should be given to possible side effects.

AI summary

This pull request makes significant improvements to how related data is loaded and streamed from the database, with a focus on memory efficiency and correctness for feed export and query operations. The main changes involve switching from joinedload to selectinload for loading collections, improving batch processing and streaming, and ensuring active feeds are prioritized in exports. These changes help prevent memory issues, reduce row multiplication, and make exported data more accurate.

Database Query and Loading Strategy Improvements:

  • Switched from joinedload to selectinload for loading collections in queries (e.g., Gtfsfeed.latest_dataset, Gtfsdataset.validation_reports, Validationreport.features, and related entities). This change prevents row multiplication and excessive memory usage when streaming large result sets. [1] [2] [3] [4] [5] [6] [7]
  • Updated batch processing logic in get_all_gtfs_feeds and get_all_gtfs_rt_feeds to use smaller batch sizes, stream results with stream_results=True, and clear the SQLAlchemy session cache between batches to avoid memory leaks. Added logging for batch progress. [1] [2]
  • Improved filtering and loader options to ensure only "active" feeds are included in certain relationships, using with_loader_criteria for correctness.
  • Enhanced error handling and static analysis compatibility in bounding box filtering logic for clarity and reliability.

Export Functionality and Data Accuracy:

  • In the CSV export logic, changed the sorting of GTFS feed references for realtime feeds to prioritize "active" feeds first, using a stable sort, and updated test data to verify this behavior. [1] [2]
  • Added a track_metrics decorator to the main CSV export function to record time, memory, and CPU usage during export operations, supporting better performance monitoring. [1] [2]

General Codebase Cleanups:

  • Improved type hints and removed unused imports for better maintainability and static analysis. [1] [2] [3]
  • Added more explicit comments and docstring clarifications throughout the code to explain rationale for loader choices and batch handling. [1] [2] [3] [4]

These changes collectively make feed queries and exports more robust, memory-efficient, and accurate.

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@jcpitre jcpitre changed the title Modified to reduce memory usage in export_csv function. fix: reduce memory usage in export_csv function. Nov 18, 2025
Copy link
Member

@davidgamez davidgamez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidgamez davidgamez merged commit 34ac9d3 into main Nov 19, 2025
2 of 3 checks passed
@davidgamez davidgamez deleted the Export-CSV-function-exceeds-memory-limit-898 branch November 19, 2025 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Export CSV function exceeds memory limit

3 participants