-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Currently, the summary endpoint for eg https://proenergia-staging.ds.io/api/v1/scenario/1/summary/Pop/ is extremely slow.
This endpoint is meant to give summary statistics for a particular key in the scenario data for a scenario.
The view handling querying and aggregating summary data is here: https://github.com/developmentseed/moz-proenergia-backend/blob/main/proenergia/datasets/views.py#L92
Ensuring we know the root cause
Aggregating the summary values across many entries is likely going to be a bit slow in the database regardless, but we should confirm:
- What are all the SQL queries being generated on the page, and how long is each taking.
There could be some quick wins here - for eg. I'm a bit worried about this line: https://github.com/developmentseed/moz-proenergia-backend/blob/main/proenergia/datasets/views.py#L97 which might be triggering a giant query that we can definitely get rid of / do differently.
So as a first pass: log all SQL queries generated and the time each takes. The only thing that should take time is the actual query generating the summary statistics.
Possible Solutions
I think solutions here would broadly fall into three buckets:
- Optimize the SQL, indexes and querying
- Implement some sort of caching
- Throw more hardware at the problem
We should likely look at these in order.
Optimizing SQL and Indexes
There's some ideas for how we could speed up the queries:
- Expensive queries to check .exists(): The two queries to check if ScenarioData and the key
exists()seem like they could be unnecessarily expensive. We should get metrics and if its those queries slowing things down, we can definitely do something a lot more light-weight to checkexists()rather than the potentially expensive queries being generated there. - Improve Indexing: There's possibly small tweaks we can do to the db indexes - one I see is a single column index on
scenario. There's possibly other tweaks to the GIN index that we could look into / perhaps ask for Bitner's help. - Improve aggregation by type logic: It maybe more efficient to get the "type" of a key first and then attempt the aggregation based on the type instead of the
try/excepthere: https://github.com/developmentseed/moz-proenergia-backend/blob/main/proenergia/datasets/views.py#L114 - but again, we'd need to measure this to find out, and this is likely not actually an issue.
We should look for quick wins on the SQL optimizing, but I wouldn't get into a rabbit hole of index optimization - if it turns out that a lot of time is actually being taken by ScenarioData.objects.filter(scenario=scenario) and not the aggregate query for eg., then that's something easy to fix.
If it seems like we just do have too many entries in ScenarioData and the aggregation query is just going to be slow, then I would strongly advocate for a caching strategy.
Caching Strategy
When looking at this, my first instinct was that the easiest thing to do would be to create a separate table with precomputed summary values, and handle updating summary values on updates to ScenarioData.
So:
- A new model like
ScenarioDataSummaryor so withscenario_id,key,type,count,min,max, etc. - A management command to update summaries for a scenario
- A celery task or so to trigger updating summaries when scenario data changes
This would likely not be a good idea of scenario data is constantly being updated, but if it is not updated much, this seems like the nicest solution to me.
Increasing database hardware
@willemarcel let's get a good handle on the queries being generated and be sure of what's causing the slowness. We can definitely throw more hardware at the database, but I want to make sure we first due our diligence on optimizing and treat that as a last resort.
My hunch is that even if we fix some issues in the queries being generated and the indexes, doing these aggregations over ~1 million values in the database is going to be a bit slow (even if we bring it down to 4-5s, that's still "slow"). I think I would very much advocate for precomputing these summary values into a separate table. @willemarcel do you see a big problem with that approach?