Skip to content

Commit 72e1ebb

Browse files
authored
Merge pull request #2345 from PolicyEngine/2344_publish_api_architecture
API Architecture doc blog post
2 parents 0ef8614 + 5a79302 commit 72e1ebb

File tree

5 files changed

+243
-0
lines changed

5 files changed

+243
-0
lines changed
134 KB
Loading

src/images/posts/api-v2.webp

239 KB
Loading

src/posts/articles/api-v2.md

Lines changed: 226 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,226 @@
1+
PolicyEngine provides comprehensive microsimulation tools through multiple platforms: our web interface, Python packages, and API infrastructure. The API enables both our website and external applications such as [MyFriendBen](https://www.myfriendben.org/) and [Benefit Navigator](https://www.imaginela.org/benefit-navigator) to access PolicyEngine’s computational engine for precise benefit calculations and policy analysis.
2+
3+
To support our expanding user base and maintain robust performance, we have documented our architectural evolution and future development priorities. This technical specification will serve as a resource for current API users and potential contributors, detailing our infrastructure design decisions and planned enhancements.
4+
5+
## Overview
6+
7+
The purpose of this document is to give a brief overview of the current architecture and propose a new target architecture and high-level incremental steps to transition from one to the other for review
8+
9+
- Are we aligned on the problem statement?
10+
- Do we agree the target architecture addresses the problem statement
11+
- Does our incremental transition align with our priorities?
12+
13+
We expect the final architecture to
14+
15+
- Reduce the Google Cloud monthly costs by 90% (-$6,500 USD) reducing the amount and type of compute kept continuously “hot” and allocating more expensive hardware on an as-needed basis.
16+
- Substantially reduce the code footprint and generally improve the maintainability of the API services we own.
17+
- Substantially improve the observability of the services we run, providing detailed data for debugging and operations.
18+
19+
We do this by adjusting our API hosting environment and implementation frameworks.
20+
21+
For our hosting environment, we propose continuing to primarily use Google Cloud Platform for hosting our services, but leveraging Cloud Run, workflows, and metrics/logs/trace services to provide better cost, scalability, reliability and observability.
22+
23+
For our API stack we plan to combine our APIs (household and general) into a single, consistent API. For implementing our API we propose switching from Flask with SQLAlchemy to FastAPI and SQLModel with opentelemetry for trace/metric/log generation. We expect these changes to substantially reduce the code footprint and maintenance burden of our API code base.
24+
25+
As migrating from our current architecture to this improved version is non-trivial, we lay out an broad, incremental plan for transitioning from the current to target infrastructure below. This plan focuses on:
26+
27+
1. Reducing the standing cost of our existing application API
28+
1. Improving the reliability and performance of our Household API
29+
1. Generalizing the Household API to just the PolicyEngine API capable of supporting both the app and external client use cases.
30+
31+
Following review of this document, we plan to create a more detailed roadmap and implementation plan focusing the most detail on the near future and continuing to develop our plan as we implement.
32+
33+
## Scope
34+
35+
**In scope**
36+
37+
- Target architecture
38+
- Household and application API implementation
39+
- Associated services such as Auth0, AppEngine, etc.
40+
- Implementation stack (i.e. Flask/APIFlask/FastAPI/etc.)
41+
- Observability
42+
- High-level path to incrementally replacing our current API
43+
- Are we making the right trade-offs/priority decisions?
44+
- What additional discussions/designs/investigations are required to actually implement our next steps?
45+
46+
**Out of scope**
47+
48+
- SPA application — this is currently being redesigned separately and should not block our API work.
49+
- Testability/Deployment — Requires its own design
50+
- Security — we need to do this, but as a separate document.
51+
- The actual simulation code (policyengine.py, population data, etc) and associated stability/replicability/deployment/etc. issues — this is being covered separately in the linked reference below.
52+
- Comparison of alternative stacks/hosting platforms — already done and included in references.
53+
- Detailed tasking/scheduling — This document will be an input to that process.
54+
55+
## References
56+
57+
- [API Stack/hosting evaluation 2025](https://docs.google.com/document/d/1HdH59-8JWihJI7Apsx_tXYkL6yVZu6T8lY4OIWIIkdA/edit?tab=t.0#heading=h.vk2hnxs7lmu5) — details of what other hosting platforms/stack configurations we considered before landing on this target architecture.
58+
- [policyengine.py package](https://docs.google.com/document/d/1YnevUaEarAl5-25veFlJP5amZHlraSSTbmAfoUTX3_o/edit?pli=1&tab=t.0) — moving all the “business logic” of running a simulation into a single package outside of the API.
59+
- [FastAPI Demo](https://github.com/mikesmit/fastapi-demo) — repository demonstrating FAST API + integration with input/output and database model validation + observability + modularity.
60+
61+
## Current Architecture
62+
63+
### Flow Diagram (Simplified)
64+
65+
![](https://cdn-images-1.medium.com/max/2000/0*qJfIPrHyVyOszV7U)
66+
67+
- **PolicyEngine App** — React Single Page Application (SPA) providing a UIX for running simulations
68+
- **External Clients**— MyFriendBen, Benefit Navigator, etc. external, paying customers of our API who build user experiences on top
69+
- **API** — Flask-based API used to support the PolicyEngine app.
70+
- Runs on a single host because the service instance is stateful
71+
- Uses a local redis service to queue simulations
72+
- Uses a local worker process to run simulations
73+
- Uses a local sqlite database for storage of some data
74+
- **Household API**— a completely separate API also implemented in flask and also running in app engine that only does household simulations
75+
- Used by external clients exclusively
76+
- **Database** — Cloud SQL managed database for storing policies, households, user data, simulation results, etc.
77+
- **Auth0**— external OAuth 2.0 provider used to authenticate external users
78+
79+
### Limitations
80+
81+
- **Scaling** — The current API scales by running more worker processes on a single, beefier container.
82+
- It is not currently possible to scale by adding more hosts because the API retains local state (SQLite database & redis task queue) which is not replicated across hosts.
83+
- **Stability** — Because they all run on one host
84+
- Simulation workers can interfere with each other and bring down the API as a whole.
85+
- Bringing down one host brings down the whole service and loses state.
86+
- **Cost** — The one container has to be scaled to support multiple workflows and stay up 100% of the time even though most of the time it is not running any workflows. This costs OOM 10K a month.
87+
- **Observability** — The various components running in App Engine to not generate trace or metric information and provide limited logging.
88+
- **Billing** — There is currently no automated mechanism to capture and bill for usage of the commercial API.
89+
- **Maintenance**— Generally the system is hard to maintain.
90+
- Two completely separate services with separate code bases to do variations of the same thing
91+
- Use of Flask without any schema-based validation of inputs/outputs, Object Relational Mapper (ORM), standard traceability/logging, standard auth integration, etc. — more errors by default, more expensive to fix, harder to debug.
92+
93+
## Target Architecture
94+
95+
### Flow Diagram
96+
97+
![](https://cdn-images-1.medium.com/max/2000/0*W6Z9B7uHUKI-LlpV)
98+
99+
NOTE: Components only called out where new relative to current architecture
100+
101+
- **PolicyEngine API** — Instead of two completely separate APIs, one common API code base used to run multiple instances.
102+
- Auto-scaled using Cloud Run service across multiple containers
103+
- Using only the dedicated hardware required to run fast-api (minimal)
104+
- Integrated with GCP logging/metrics/trace for observability.
105+
- Uses GCP Cloud Workflow to delegate simulation tasks
106+
- **GCP Workflow** — GCP-based orchestrator able to run a sequence of tasks, handle retries/errors, etc.
107+
- Supports running multiple parallel tasks to speed up economic simulations.
108+
- **Simulation API** — FastAPI based Cloud Run TBD (job, service, function) used to actually execute simulation tasks on appropriate hardware.
109+
- Set up to auto-scale and use appropriate hardware for appropriate tasks.
110+
- **Stripe** — Automated billing for API usage of paying customers.
111+
- **logging/metrics/trace** — GCP observability integrations, automatically generated by all FastAPI-based services for all operations, SQL statements, and logs.
112+
113+
### Benefits
114+
115+
- **Scaling** - API is stateless and containers can be added/removed to support traffic
116+
- Compute-heavy tasks run in separate containers and can be separately scaled to support traffic
117+
- **Stability** - Failure of a single container does not cause loss of state
118+
- Failure of a single container does not bring down the whole service
119+
- Retries and error state for simulation runs is managed by the GCP Cloud Workflow orchestrator
120+
- State is entirely maintained outside of any one container so loss of a container has limited impact on data loss.
121+
- **Cost**
122+
- Compute tasks can be assigned to an appropriate container and be scaled appropriate to demand, reducing the need for and cost of “always on” hosts.
123+
- **Observability** - All FastAPI-based services will be integrated with OpenTelemetry for metrics/logging/trace information providing good default observability for all services (latency, error rate, and log details for all operations and SQL queries by default)
124+
- **Billing** — Addressed via Stripe integration.
125+
- **Maintenance**
126+
- Unified code base — “Household API” vs. “API” is just a difference of configuration using a standard code base of standard API operations and options. All APIs support the same integrated features like database storage and observability.
127+
- FastAPI is used in combination with SQLModel, Auth0 security integration, opentelemetry integration, as well as pydantic validation of all input/output models to
128+
- Dramatically reduce errors
129+
- Dramatically reduce code
130+
131+
## Transition Plan
132+
133+
### Phase 1 — Reduce Cost of the main API
134+
135+
We initially tackle main API by removing the REDIS queue/worker setup and replacing it with a GCP workflow executed against a new Simulation API based on FastAPI and policyengine.py.
136+
137+
We then scale down the App Engine instance to just support running a Flask API, reducing the cost of the always-on host and configure the Simulation API to scale when used only, reducing that cost to the time to run actual simulation requests.
138+
139+
The main API otherwise remains the same and has many of the same limitations (local state, lack of easy observability, etc.)
140+
141+
Household API is unchanged.
142+
143+
### Flow Diagram
144+
145+
![](https://cdn-images-1.medium.com/max/2000/0*ckFoc6RAUJsqOQ7h)
146+
147+
### Benefits
148+
149+
- Cost reduction
150+
- Demonstrate/vet technologies we propose for the main api
151+
- Demonstrate integration with the Workflow service and GCP logging/metrics/trace
152+
- Demonstrate FastAPI on Cloud Run with scaling
153+
154+
## Phase 2 — Implement Billing and Operations Improvements for Paying Customers
155+
156+
Household API is completely replaced by a new FastAPI-based implementation. This implementation is based on the full target architecture, but only implements the household simulation part of it.
157+
158+
Household simulations are executed using the same simulation API which is based on the new policyengine.py package.
159+
160+
### Flow Diagram
161+
162+
![](https://cdn-images-1.medium.com/max/2000/0*NirH7dEArZmw9CSZ)
163+
164+
### Benefits
165+
166+
- **Automated billing** — using Stripe, we can now automate metering and billing our API customers by usage.
167+
- **Additional Flexibility** — the addition of the workflow and database mean the household API can be extended to operate like the main API (and this is the next step) will all the same features.
168+
- **Improved observability/operability **— household API now generates traces/logging/metrics which will support robust alarming and debuggability.
169+
- **Demonstrate/vet technologies**
170+
- Allows us to implement a small portion of the full API surface area (just household) using the full target architecture
171+
172+
## Phase 3+
173+
174+
We fully replace the web application with the new design currently being identified and implement it on top of the “Household API” by adding functionality until it is just the “PolicyEngine API”.
175+
176+
This will involve multiple phases and additional design work to flesh out.
177+
178+
- What data that customers re-use do we need to persist from the existing application/API?
179+
- What links that customers may have saved/referenced do we need to persist from the existing application?
180+
- What schema will we be using in our database to model users/policies/households/etc.?
181+
182+
## Cost Analysis
183+
184+
We estimate we could reduce our current monthly GCP compute bill from ~$7,000 a month to no more than $500 a month, net (based on very conservative assumptions). This should reduce our overall GCP cost by about 90%.
185+
186+
The primary driver of cost now is running AppEngine. The primary cost in the new system is running simulations as Cloud Run Functions.
187+
188+
In the target architecture the main river of cost is running our simulation (policyengine.py) in Cloud Run functions:
189+
190+
1. If running the simulation requires memory beyond 16G (doubles the cost from 16 to next increment of 32)
191+
1. The number of simulations we run increases substantially
192+
1. The average length of simulations increases substantially.
193+
194+
This analysis was done assuming substantially more traffic than what we currently receive all day, every day, all month.
195+
196+
### Current Major Drivers of Cost
197+
198+
AppEngine is by far the major driver of infrastructure costs on PolicyEngine comprising 90% of the GCP infrastructure cost.
199+
200+
![](https://cdn-images-1.medium.com/max/2000/1*yCLAHlhJYMKXkVIeYVYc0A.png)
201+
202+
### Estimated Compute Cost (Target Architecture)
203+
204+
Assuming we have transitioned both APIs to just do API and delegating all simulation to a workflow.
205+
206+
Cost was estimated using these usage estimates:
207+
208+
- Simulation API
209+
210+
- Cloud Run function using 16GB and 2 CPUs — based on existing app engine configuration.
211+
- One minimum instance is hot at all times — reduce cold start time
212+
- 100 simulations a day every day all month averaging 15 minutes to run — WAG
213+
- Only one concurrent request per function at a time — currently safe limitation.
214+
215+
- API compute
216+
217+
- 2 always active servers running the front end API and servicing 1M requests a month (roughly 1,000 requests and hour all month which is a very conservative WAG)
218+
- 2 CPU/1G RAM
219+
- 80 requests concurrent
220+
- 500 ms response
221+
222+
- Workflow (not in calculator despite documentation assertions. Pricing is here: [https://cloud.google.com/workflows/pricing)](https://cloud.google.com/workflows/pricing))
223+
- Assuming 2 internal steps per simulation workflow
224+
- Assuming 100 simulations a day every day for a month (100\*30 = 3000)
225+
226+
Total Cost: $538 a month ([calculator here](https://cloud.google.com/products/calculator?hl=en&dl=CjhDaVF4TmpnNFpEWmtNaTAwWTJSaUxUUTRPRGd0T0dSbFlTMHlNR0l6Tm1ZeE1HWTFPV1VRQVE9PRAcGiQ4RTAyNUQwMy02RENFLTQ5RjQtQTUxNi0xNEQ4NDFCNERDNEE))

src/posts/authors.json

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,5 +125,13 @@
125125
"twitter": "twitter.com/jasondebacker",
126126
"headshot": "jason-debacker.jpeg",
127127
"title": "Associate Professor of Economics at the University of South Carolina"
128+
},
129+
"michael-smit": {
130+
"name": "Michael Smit",
131+
"email": "michael@policyengine.org",
132+
"bio": "Michael is a Software Engineer volunteering at PolicyEngine",
133+
"linkedin": "https://www.linkedin.com/in/michaeldsmit/",
134+
"headshot": "michael-smit.jpeg",
135+
"title": "Volunteer Software Engineer for PolicyEngine"
128136
}
129137
}

src/posts/posts.json

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1063,5 +1063,14 @@
10631063
"authors": ["david-trimmer"],
10641064
"filename": "ny-wftc.md",
10651065
"image": "ny-wftc.webp"
1066+
},
1067+
{
1068+
"title": "How We're Improving our API in 2025",
1069+
"description": "PolicyEngine's API target architecture for 2025.",
1070+
"date": "2025-02-05",
1071+
"tags": ["technical", "global"],
1072+
"authors": ["michael-smit"],
1073+
"filename": "api-v2.md",
1074+
"image": "api-v2.webp"
10661075
}
10671076
]

0 commit comments

Comments
 (0)