You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pull request removes the `nogevent` compatibility layer from the
library and tests. It also changes the behavior of some feature flags
related to gevent and adjusts a whole lot of tests to work with these
changes.
This change was manually tested on delancie-dummy staging and shown to
fix the issue we were investigating there. See noteboook 4782453 in the
ddstaging datadoghq account for a detailed look at the metrics collected
during that test.
## What Changed?
* `sitecustomize.py`: refactored module cloning logic; changed the
default of `DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE` to `auto` meaning it
runs when gevent is installed; unloaded a few additional modules like
`time` and `attrs` that were causing tests and manual checks to fail;
deprecated `DD_GEVENT_PATCH_ALL` flag
* Removed various hacks and layers that had intended to fix gevent
compatibility, including in `forksafe.py`, `periodic.py`,
`ddtrace_gevent_check.py`, and `nogevent.py`
* `profiling/`: adjusted all uses of the removed module `nogevent` to
work with `threading`
* Adjusted tests to work with these removals
We tried to separate some of these changes into other pull requests,
which you can see linked in the discussion below. Because of how
difficult it is to replicate the issue we're chasing outside of the
staging environment, we decided to minimize the number of variables
under test and combine the various changes into a single pull request.
This makes it a bit harder to review, which we've tried to mitigate with
the checklist above.
## Risk
The main risk in this change is the change to the default behavior of
module cloning. We've mitigated this risk with the automated test suite
as well as manual testing described in the notebook above.
**Why doesn't this change put all new behavior behind a feature flag and
leave the default behavior untouched?**
The main reason for this decision is pragmatic: it's really hard to test
for the issue this solves, requiring a turnaround time of about an hour
to get feedback from changes. The secondary reason is that the
`nogevent` layer is highly coupled to the rest of the library's code,
and putting it behind a feature flag is a significant and nontrivial
effort. The third reason is that full support of all of the various
configurations and combinations with other tools that gevent can be used
in is a goal that we could probably spend infinite time on if we chose
to. Given this, we need to intentionally set a goal that solves the
current and likely near-future issues as completely as possible, make it
the default behavior, and call this effort "done". @brettlangdon@P403n1x87@Yun-Kim and I are in agreement that the evidence in
noteboook 4782453 in the ddstaging datadoghq account is enough to
justify this change to the default behavior.
## Performance Testing
Performance testing with a sample flask application (notebook 4442578)
shows no immediately noticeable impact of tracing.
Dynamic instrumentation seems to cause slow-downs, and the reason has
been tracked down to joining service threads on shutdown. Avoiding the
joins cures the problem, but further testing is required to ensure that
DI still behaves as intended.
Profiling also shows a slow-down in the application response when
enabled. This seems to be due to retrieving the response from the agent
after the payload has been uploaded. A potential solution to this might
be offered by libdatadog.
The following are the details of the scenario used to measure the
performance under different configurations.
The application is the simple Flask app of the issue reproducer:
```
# app.py
import os
import time
from ddtrace.internal.remoteconfig import RemoteConfig
from flask import Flask
app = Flask(__name__)
def start():
pid1 = os.fork()
if pid1 == 0:
os.setsid()
x = 2
while x > 0:
time.sleep(0.2)
x -= 1
else:
os.waitpid(pid1, 0)
@app.route("/")
def index():
start()
return "OK" if RemoteConfig._worker is not None else "NOK"
```
We can control what products to start with the following run.sh script
```
#!/bin/bash
source .venv/bin/activate
export DD_DYNAMIC_INSTRUMENTATION_ENABLED=true
export DD_PROFILING_ENABLED=false
export DD_TRACE_ENABLED=false
export DD_ENV=gab-testing
export DD_SERVICE=flask-gevent
ddtrace-run gunicorn -w 3 -k gevent app:app
deactivate
```
To run the app we create a virtual environment with
```
python3.9 -m venv .venv
source .venv/bin/activate
pip install flask gevent gunicorn
pip install -e path/to/dd-trace-py
deactivate
```
and then invoke the script, adjusting the exported variables as required
```
chmod +x run.sh
./run.sh
```
In another terminal we can check the average response time by sending
requests to the application while running. With the following simple k6
script
```
import http from 'k6/http';
export default function () {
http.get('http://localhost:8000');
}
```
We invoke k6 with
```
k6 run -d 30s script.js
```
and look for this line in the output
```
http_req_duration..............: avg=335.68ms min=119.56ms med=418.76ms max=451.49ms p(90)
```
Co-authored-by: Yun Kim <[email protected]>
Co-authored-by: Gabriele N. Tornetta <[email protected]>
Co-authored-by: Juanjo Alvarez Martinez <[email protected]>
Co-authored-by: Gabriele N. Tornetta <[email protected]>
Co-authored-by: Yun Kim <[email protected]>
Co-authored-by: Brett Langdon <[email protected]>
0 commit comments