fix: reduce memory usage in export_csv function. #1469
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
Closes #898
Changed the way we obtain the feeds from the DB. Reduced batch size from 500 to 50. Since this is likely to take more time to execute, increased the timeout from 10 minutes to 1 hour (but 1 hour might be an overkill).
Modified the way the joins are done while querying the DB so it uses less memory.
Note: This PR also affects the API calls. All tests pass, but extra attention should be given to possible side effects.
AI summary
This pull request makes significant improvements to how related data is loaded and streamed from the database, with a focus on memory efficiency and correctness for feed export and query operations. The main changes involve switching from
joinedloadtoselectinloadfor loading collections, improving batch processing and streaming, and ensuring active feeds are prioritized in exports. These changes help prevent memory issues, reduce row multiplication, and make exported data more accurate.Database Query and Loading Strategy Improvements:
joinedloadtoselectinloadfor loading collections in queries (e.g.,Gtfsfeed.latest_dataset,Gtfsdataset.validation_reports,Validationreport.features, and related entities). This change prevents row multiplication and excessive memory usage when streaming large result sets. [1] [2] [3] [4] [5] [6] [7]get_all_gtfs_feedsandget_all_gtfs_rt_feedsto use smaller batch sizes, stream results withstream_results=True, and clear the SQLAlchemy session cache between batches to avoid memory leaks. Added logging for batch progress. [1] [2]with_loader_criteriafor correctness.Export Functionality and Data Accuracy:
track_metricsdecorator to the main CSV export function to record time, memory, and CPU usage during export operations, supporting better performance monitoring. [1] [2]General Codebase Cleanups:
These changes collectively make feed queries and exports more robust, memory-efficient, and accurate.
Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.shto make sure you didn't break anything