-
Notifications
You must be signed in to change notification settings - Fork 8
Description
Background
The CSV download endpoint has been a long time source of instability in the application. The database now has an estimated 913 millions timeseries result records. To manage those, we employee a chunking strategy in which data are grouped into 30-day blocks by datetime, and then further grouped by result ids before being compressed. This is managed through TimescaleDB.
The data chunking scheme is much better suited for the access patterns we have implemented through the frontend application, more narrowly specified datetime ranges that are spread across multiple requests. In the CSV download endpoint, requests that span a wide date range, require retrieving and uncompressing large amounts of data to compile into a single CSV file. For some stations, a CSV request spanning the full date range, might require decompressing nearly all of the chucked data records to extract the results of interests. Accomplishing that decompression in single request locks up the application (or at least the database) and prevents it from serving other traffic.
We should consider mitigation strategies for the database table locks. We should also consider restricting the datetime span on the CSV download to require multiple requests, or implement pagination for longer timespans.
Closure Criteria
- Better understand and document how a CSV request acquires locks on database tables.
- Evaluate different date time spans to determine a reasonable performance/datetime span cutoff.
- Modify the csv-endpoint and frontend to limit time span to appropriate cut off. I.e. multiple requests to fetch the entire dataset.
- Alternatively Implement some sort of pagination methods (e.g. Django's Paginator Class)