-
Notifications
You must be signed in to change notification settings - Fork 31
INTPYTHON-847 Add performance tests #366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
In general, awesome! Re:
Can this be integrated with existing tests such that some established testing conventions are followed? For example,
IOW If the |
Running the performance tests alongside the Django tests would add too much runtime to be practical. Putting the Is there a way to independently run tests on the |
|
I'm inclined to agree that the performance tests should be in their own directory Probably integrating with Django's |
| class TestSmallFlatDocCreation(SmallFlatDocTest, TestCase): | ||
| def do_task(self): | ||
| for doc in self.documents: | ||
| model = SmallFlatModel(**doc) | ||
| model.save() | ||
|
|
||
| def after(self): | ||
| SmallFlatModel.objects.all().delete() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncertain how this whole thing fits together. On the hand, you inherit django.test.TestCase which has its own setUp/tearDown strategy, on the other hand, you seem to implement your scheme with an after() method. Some classes use after(), others tearDown()... is it intentional? (edit: perhaps answered by the benchmark spec.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ODM benchmark spec (draft PR here: mongodb/specifications#1828) is helpful here, but the short answer is yes, we need the ability to do setup or teardown both before and after every iteration of a test as well as the entire set of iterations.
tests/performance/perftest/tests.py
Outdated
| def after(self): | ||
| SmallFlatModel.objects.all().delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little uncertain how this whole thing fits together. On the hand, you inherit django.test.TestCase which has its own setUp/tearDown strategy, on the other hand, you seem to implement your scheme with an after() method. Reading the spec, it looks like some after() methods should be tearDown(). In this case, if you delete the objects after the first iteration, future iterations are actually creating new models rather than updating existing ones. Check your other tests too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Definitely a bug here. I'll have to modify the benchmark to ensure that each iteration of updated_value is unique so the database actually performs an update operation.
| # Install django-mongodb-backend | ||
| /opt/python/3.10/bin/python3 -m venv venv | ||
| . venv/bin/activate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT:
Whilst I'd rather not have a new dependency on drivers-evergreen-tools, to future-proof the binary usage, you could add in the find_python function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This I copied over from the existing run-tests.sh script here. Let's update them both in a separate ticket.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to bring in drivers-evergreen-tools anyway for QE
| name = self.__class__.__name__[4:] | ||
| median = self.percentile(50) | ||
| megabytes_per_sec = self.data_size / median / 1000000 | ||
| print( # noqa: T201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use LOGGER here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logging.info doesn't play nicely with tests here. For perf tests I don't think it's worth the effort to fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is awkward (dots from each test passing, mixed with these prints):
System check identified no issues (0 silenced).
Completed TestLargeFlatDocCreation 46.140 MB/s, MEDIAN=52.461s, total time=94.050s, iterations=1
.Completed TestLargeFlatDocFilterPkByIn 585.696 MB/s, MEDIAN=4.133s, total time=53.420s, iterations=2
.Completed TestLargeFlatDocUpdate 0.001 MB/s, MEDIAN=24.834s, total time=81.044s, iterations=1
The elegant solution might be to use a custom test runner / test result (which has access to the verbosity level) so that this debugging information can be output only at verbosity > 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not to be flippant, but I don't think this really matters or should be a blocker for merging. This debugging output will only be relevant for local runs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, just mentioning it for posterity. Implementation details could be borrowed from Django's --debug-sql option.
| self.documents = [self.document.copy() for _ in range(NUM_DOCS)] | ||
|
|
||
|
|
||
| class TestSmallFlatDocCreation(SmallFlatDocTest, TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to inherit django.test.TestCase? It's unclear to me because I believe PerformanceTest overrides many methods without calling super() which blocks at least some of TestCase's normal setup and teardown operations. I'll have to study this to check how it actually works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PerformanceTest does not inherit from any class, it's effectively a mixin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that all the benchmarks inherit django.test.TestCase which has certain functionality such as cleaning the database between each test, calling a user-defined setupTestData() (if it exists), however, it's unclear if any of this functionality is used. Each test has a manual tearDown(), for example, and the benchmark seems to work just as well with:
diff --git a/performance_tests/perftest/tests.py b/performance_tests/perftest/tests.py
index 6f45511d..a11ed7e6 100644
--- a/performance_tests/perftest/tests.py
+++ b/performance_tests/perftest/tests.py
@@ -14,6 +14,10 @@ from pathlib import Path
from bson import ObjectId, encode, json_util
from django.test import TestCase
+import unittest
+
+TestCase = unittest.TestCase
+
from .models import (
ForeignKeyModel,
IntegerEmbeddedModel,There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I'll use unittest.TestCase instead of Django's.
| result_data: list = [] | ||
|
|
||
|
|
||
| def tearDownModule(): | ||
| output = json.dumps(result_data, indent=4) | ||
| if OUTPUT_FILE: | ||
| with open(OUTPUT_FILE, "w") as opf: # noqa: PTH123 | ||
| opf.write(output) | ||
| else: | ||
| print(output) # noqa: T201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uff... I wonder if this blocks splitting up the tests into multiple files? At least it may need to become "append aware."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, splitting into multiple files is complicated by our infrastructure expecting a single JSON output file for the entire suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After I suggested a default for OUTPUT_FILE, I now see that would break this if/else. Maybe we just remove OUTPUT_FILE from the documentation on how to run the tests? And actually, perhaps it's appropriate to remove this print statement also as I imagine the test by test output would be more useful than the JSON version meant for machine consumption?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The JSON version is a nice simple summary that's useful for local use as well. Setting a default in the run-perf-tests.sh script that CI/CD uses makes sense, but we shouldn't set a default in the actual test suite iteself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not going to argue, but my feeling is that:
TestSmallFlatDocUpdate 0.007 MB/s, MEDIAN=4.282s, total time=8.648s, iterations=2
is much easier to read than:
{
"info": {
"test_name": "SmallFlatDocUpdate"
},
"metrics": [
{
"name": "megabytes_per_sec",
"type": "MEDIAN",
"value": 0.007472334166114841,
"metadata": {
"improvement_direction": "up",
"measurement_unit": "megabytes_per_second"
}
}
]
}
and to get to the former, I have to scroll up past the huge JSON dump. (Of course, nothing is stopping someone from adding or removing that print statement while testing their changes.)
Incidentally, the JSON results don't include the elapsed time. Is it correct that we won't track that metric? The spec says, "In addition to latency data, ..."
Where can details about the metadata format (improvement_direction, etc.) be found?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't report elapsed time in the JSON metrics, correct. The JSON metrics are tracked by Evergreen's charting, and we want a single metric easily comparable across ODMs there. We record elapsed time individually as part of each suite run as a secondary metric if needed for investigation.
The metadata format is for Evergreen chart tracking.
| display_name: Performance Benchmarks | ||
| run_on: | ||
| - rhel90-dbx-perf-large | ||
| batchtime: 1440 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does it mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
batchtime ensures that the performance tests run up to once per day.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, so the other thread where you said "these tests are going to run once for each commit merged into master" isn't correct? There could be multiple commits per day but just one run, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my earlier comment was misleading--that's correct, we expect one run of the perf tests per day that a commit is merged into master.
| tasks: | ||
| - name: perf-tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the PR description you wrote, "The full suite run locally is expected to take approximately 10 minutes." On the latest Evergreen build for this PR, it took 35 minutes... is it expected? That's a bit slower than the rest of our builds (~20 minutes). Is this going to run for every push to every PR? It's probably unnecessary. Perhaps it could be a very fast version with very low iterations just to ensure these tests don't break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was for an earlier iteration that didn't fully implement the spec.
No, these tests are going to run once for each commit merged into master, not once per PR push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So they won't run at all on PRs? What do you think about my point about running with one/low iterations on PRs to ensure a PR doesn't break the suite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the suite runs public APIs only, I don't think having an extra suite just to test that the suite itself doesn't break is worth it. Anything that breaks this suite should be caught by other tests failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the performance tests are running on this PR... Is there some change we need to make before merging this so that they don't run on PRs? What happens when we need to make a change to the performance tests? Do we temporarily modify some trigger to have them run on that PR, then revert the trigger before merging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll have to update the Evergreen project settings to exclude the benchmark tests on PRs. Changes to the perf tests would require scheduling Evergreen patches to confirm changes work as expected.
2k docs 1k docs 100 docs for TestLargeNestedDocFilterArray
| class TestSmallFlatDocFilterPkByIn(SmallFlatDocTest, TestCase): | ||
| def setUp(self): | ||
| super().setUp() | ||
| self.ids = [] | ||
| models = [] | ||
| for doc in self.documents: | ||
| models.append(SmallFlatModel(**doc)) | ||
|
|
||
| SmallFlatModel.objects.bulk_create(models) | ||
| self.ids = [model.id for model in models] | ||
|
|
||
| def do_task(self): | ||
| list(SmallFlatModel.objects.filter(id__in=self.ids)) | ||
|
|
||
| def tearDown(self): | ||
| super().tearDown() | ||
| SmallFlatModel.objects.all().delete() | ||
|
|
||
|
|
||
| class TestLargeFlatDocFilterPkByIn(LargeFlatDocTest, TestCase): | ||
| def setUp(self): | ||
| super().setUp() | ||
| models = [] | ||
| for doc in self.documents: | ||
| models.append(LargeFlatModel(**doc)) | ||
| LargeFlatModel.objects.bulk_create(models) | ||
| self.id = LargeFlatModel.objects.first().id | ||
|
|
||
| def do_task(self): | ||
| list(LargeFlatModel.objects.filter(id__in=[self.id])) | ||
|
|
||
| def tearDown(self): | ||
| super().tearDown() | ||
| LargeFlatModel.objects.all().delete() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These classes have the same name except for LargeFlatDoc vs. SmallFlatDoc. but the queries are different: (id__in=self.ids) vs. (id__in=[self.id]). Is it intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not intentional, the pattern used by SmallFlatDoc is the intended one.
bbf4547 to
92a680e
Compare
| Django MongoDB Backend uses a benchmarking suite to catch performance regressions. | ||
| This suite is located in the top-level ``performance_tests`` directory of the Django MongoDB Backend repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd also like to see a discussion (linking to the benchmark spec as appropriate) about what this is benchmarking (both Python code and MongoDB queries), how to interpret the results (Mb/sec??), where do we view the ongoing results, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean here. We could add a link to the benchmark spec, which details the purpose and scope of the suite? There's already a link in the actual test file. Each run will have an evergreen chart attached to its run, there isn't a single place with all the results added dynamically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, how can I use this tool to catch performance regressions? Am I supposed to run it locally and compare before and after? Is there a histogram of results? Will regressions be automatically detected? How will we know if the results are statistically significant?
Django has https://github.com/django/django-asv/tree/main and results displayed at https://django.github.io/django-asv/ but it's not obvious how to interpret and use the results, so I have similar concerns here.
Is this supposed to catch performance regressions in Django MongoDB Backend, MongoDB, or both? How can we differentiate?
The spec answers some questions but it's a long document with a lot of unnecessary details. Maybe high-level "cheat sheet" would be useful or other ODMs as well.
I suppose not all questions need to be answered before merging this (perhaps some answers are unknown), but this is what I've been thinking about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Short answer: https://jira.mongodb.org/browse/INTPYTHON-530 will set up automated alerting to be identical to our current PyMongo automated regression alerts. Here's a link to what the charting looks like for that, it'll look similar for Django: https://spruce.mongodb.com/task/mongo_python_driver_performance_benchmarks_perf_8.0_standalone_a89c5e3a89fa7e621f03fc9173067256a5fdc300_26_01_26_19_36_51/trend-charts?execution=0.
The expectation is that the benchmark suite will run up to once per day on commit merge and automatically alert us in Slack if any regressions are detected. No local runs are expected generally.
.evergreen/run_perf_test.py
Outdated
|
|
||
| LOGGER = logging.getLogger("test") | ||
| logging.basicConfig(level=logging.INFO, format="%(levelname)-8s %(message)s") | ||
| OUTPUT_FILE = os.environ.get("OUTPUT_FILE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make "results.json" the default so it's not required to be set? Maybe the environ variable is useful for local debugging but I don't see any other values set in the this PR, right?
| .. code-block:: bash | ||
|
|
||
| $ git clone --depth 1 https://github.com/mongodb/specifications.git | ||
| $ pushd specifications/source/benchmarking/odm-data | ||
| $ tar xf flat_models.tgz | ||
| $ tar xf nested_models.tgz | ||
| $ popd | ||
| $ export DJANGO_MONGODB_PERFORMANCE_TEST_DATA_PATH="specifications/source/benchmarking/odm-data" | ||
| $ export OUTPUT_FILE="results.json" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These instructions should put the data in performance_tests/odm-data so that the environ variable isn't required.
| self.documents = [self.document.copy() for _ in range(NUM_DOCS)] | ||
|
|
||
|
|
||
| class TestSmallFlatDocCreation(SmallFlatDocTest, TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant is that all the benchmarks inherit django.test.TestCase which has certain functionality such as cleaning the database between each test, calling a user-defined setupTestData() (if it exists), however, it's unclear if any of this functionality is used. Each test has a manual tearDown(), for example, and the benchmark seems to work just as well with:
diff --git a/performance_tests/perftest/tests.py b/performance_tests/perftest/tests.py
index 6f45511d..a11ed7e6 100644
--- a/performance_tests/perftest/tests.py
+++ b/performance_tests/perftest/tests.py
@@ -14,6 +14,10 @@ from pathlib import Path
from bson import ObjectId, encode, json_util
from django.test import TestCase
+import unittest
+
+TestCase = unittest.TestCase
+
from .models import (
ForeignKeyModel,
IntegerEmbeddedModel,| name = self.__class__.__name__[4:] | ||
| median = self.percentile(50) | ||
| megabytes_per_sec = self.data_size / median / 1000000 | ||
| print( # noqa: T201 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output is awkward (dots from each test passing, mixed with these prints):
System check identified no issues (0 silenced).
Completed TestLargeFlatDocCreation 46.140 MB/s, MEDIAN=52.461s, total time=94.050s, iterations=1
.Completed TestLargeFlatDocFilterPkByIn 585.696 MB/s, MEDIAN=4.133s, total time=53.420s, iterations=2
.Completed TestLargeFlatDocUpdate 0.001 MB/s, MEDIAN=24.834s, total time=81.044s, iterations=1
The elegant solution might be to use a custom test runner / test result (which has access to the verbosity level) so that this debugging information can be output only at verbosity > 1.
Purpose
This is the first draft of the Django implementation of the MongoDB ODM Benchmarking suite. It contains a small suite of benchmark tests designed to measure django-mongodb-backend performance across data sizes and structures for what we expect to be common user operations.
This is NOT intended to be a comprehensive test suite for every operation, only the most common and widely applicable. This is also not intended to be Django specific: each of these tests must be implementable across all of our ODMs, up to the features supported by each library.
Structure
The benchmarks are contained within a separate Django application within
tests/performance/perftest. This application exists only to execute the benchmarking suite.Running Locally
After starting a local MongoDB server, the tests can be run by running
python manage.py testfrom thetests/performancedirectory. The full suite run locally is expected to take approximately 10 minutes. For faster benchmarks, passFASTBENCH=1as an environment variable totests/performance/perftest/tests.py.Review Items