feat: add deterministic seed_random management command for synthetic …#3041
feat: add deterministic seed_random management command for synthetic …#3041abdihakim92x1 wants to merge 40 commits intomainfrom
Conversation
jeanpierrefouche-ukhsa
left a comment
There was a problem hiding this comment.
Good code! Please see my comments.
50b07fa to
a70c824
Compare
jrdh
left a comment
There was a problem hiding this comment.
There may be some other things I could pick up on in here but I'm going to pause going through it because one of the ACs is unmet and is a bigger problem - specifically "The data is generated and uploaded to the s3 bucket in the target dev env, not directly inserted into the database". Instead of inserting the metric data into the database, this should be generating JSON files and placing them in the ingestion bucket for the target environment. This allows us to to test ingestion and validation, as well as avoids duplication of logic related to how metric data is added to the database. Appreciate this means a change in approach so do shout if there was a reason you went down this path instead of following the AC.
| Stratum.objects.all().delete() | ||
|
|
||
| @classmethod | ||
| def _seed_time_series_rows( |
There was a problem hiding this comment.
Can we add a docstring please?
4d2477c to
c99d952
Compare
Thanks, agree this AC calls for generating payloads into the ingestion S3 bucket rather than writing directly to metrics tables. |
jrdh
left a comment
There was a problem hiding this comment.
As well as the few comments throughout, it'd also be good to add a system test added in the same vein as the build_cms_site system tests - i.e. use the actual command with the database and confirm that we get what we expect out of it using the API / querying the database.
Otherwise looks decent so far, though I haven't had a chance to test it in my aws env yet. I'll reply to the comment about the AC in a mo.
| truncate_first: bool, | ||
| progress_callback: Callable[[str], None] | None = None, | ||
| ) -> dict[str, int]: | ||
| """Seed supporting metric models and time series rows for the selected scale.""" |
There was a problem hiding this comment.
Can we add param docs please?
| @classmethod | ||
| def _seed_theme_hierarchy(cls) -> tuple[list[Theme], list[SubTheme], list[Topic]]: | ||
| theme_names, sub_theme_rows, topic_rows = cls._build_theme_hierarchy_records() | ||
| themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names]) |
There was a problem hiding this comment.
When testing locally, if I run uhd bootstrap all and then run this seeding command without any truncation (e.g. python manage.py seed_random --dataset metrics) I get integrity errors. Try seed 1773741316 and you should get the same. Essentially the hierarchy created here needs to either take the existing hierarchy into account when it's generated or when bulk creating the records in the db we need to be ok with and manage the integrity errors.
Here's the stack trace:
((.venv) ) ➜ data-dashboard-api git:(task/CDD-3154-seed-random-data) python manage.py seed_random --dataset metrics
Seed used: 1773741316
Seeding metrics dataset...
Preparing metric taxonomy and geography records...
Traceback (most recent call last):
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
return self.cursor.execute(sql, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute
return super().execute(query, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
sqlite3.IntegrityError: UNIQUE constraint failed: data_theme.name
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/josh/work/data-dashboard-api/manage.py", line 23, in <module>
main()
File "/home/josh/work/data-dashboard-api/manage.py", line 19, in main
execute_from_command_line(sys.argv)
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
utility.execute()
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/__init__.py", line 436, in execute
self.fetch_command(subcommand).run_from_argv(self.argv)
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 420, in run_from_argv
self.execute(*args, **cmd_options)
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/core/management/base.py", line 464, in execute
output = self.handle(*args, **options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 96, in handle
counts = self._seed_metrics_data(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 133, in _seed_metrics_data
themes, sub_themes, topics = cls._seed_theme_hierarchy()
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 259, in _seed_theme_hierarchy
themes = cls._bulk_create(Theme, [Theme(name=name) for name in theme_names])
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/metrics/interfaces/management/commands/seed_random.py", line 376, in _bulk_create
return model.objects.bulk_create(list(records))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/manager.py", line 87, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 825, in bulk_create
returned_columns = self._batched_insert(
^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1901, in _batched_insert
self._insert(
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/query.py", line 1873, in _insert
return query.get_compiler(using=using).execute_sql(returning_fields)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/models/sql/compiler.py", line 1882, in execute_sql
cursor.execute(sql, params)
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 122, in execute
return super().execute(sql, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 79, in execute
return self._execute_with_wrappers(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 92, in _execute_with_wrappers
return executor(sql, params, many, context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 100, in _execute
with self.db.wrap_database_errors:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/utils.py", line 91, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/utils.py", line 105, in _execute
return self.cursor.execute(sql, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/josh/work/data-dashboard-api/.venv/lib/python3.12/site-packages/django/db/backends/sqlite3/base.py", line 360, in execute
return super().execute(query, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
django.db.utils.IntegrityError: UNIQUE constraint failed: data_theme.name
| geography=geography, | ||
| stratum=stratum, | ||
| age=age, | ||
| sex=None, |
There was a problem hiding this comment.
Could we randomise this between the currently allowed values (male, female, all)?
| "small": {"geographies": 5, "metrics": 10, "days": 30}, | ||
| "medium": {"geographies": 20, "metrics": 50, "days": 180}, | ||
| "large": {"geographies": 100, "metrics": 200, "days": 365}, | ||
| } |
There was a problem hiding this comment.
Would be nice to add a comment here to give an indication of the scale of these. From testing myself, small gets you about 1500 time series records, medium gets you 180000, and large gets you 7.3 million.
I'd rather the original brief was fulfilled than some work merged and then another ticket created to do what the original work should have done. Given none of this functionality exists yet and there's no urgent need for this, that I'm aware of, the only downside I can see is there will be chunks of code in this PR we don't need any more. From your investigations to get to this point did you find anything that would make using the S3 bucket ingestion pathway problematic to implement? |
b274ea0 to
3c728b1
Compare
…311) instead of # noqa: S311
…ff and Bandit at the exact call sites.
…t hierarchy seeding, and end-to-end test coverage
6a4ee5a to
75cc893
Compare
…KHSA-Internal/data-dashboard-api into task/CDD-3154-seed-random-data
|


…dev data
Description
This PR includes the following:
Fixes #CDD-3154 Partly
Type of change
Please select the options that are relevant.
Checklist: