forked from nhs-england-tools/playwright-python-blueprint
-
Notifications
You must be signed in to change notification settings - Fork 2
Added logic to check and correct for duplicate records. #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
adrianoaru-nhs
merged 6 commits into
main
from
feature/BCSS-20328-compartment-5-data-management
May 8, 2025
Merged
Changes from 4 commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
22bca70
Added logic to check and correct for duplicate records.
adrianoaru-nhs 19226b7
fixed value issue
adrianoaru-nhs 465eb9d
Merge branch 'main' of github.com:NHSDigital/bcss-playwright into fea…
adrianoaru-nhs 49a4cf3
Fixed spelling issue
adrianoaru-nhs 9ad5d3a
Changed prove to provide
adrianoaru-nhs fe39246
Removing data validation util as the same thing can be achieved with …
adrianoaru-nhs File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Utility Guide: Data Validation | ||
|
|
||
| The Data Validation Utility can be used to check if there are any duplicate values returned from an SQL query. | ||
|
|
||
| ## Table of Contents | ||
|
|
||
| - [Utility Guide: Data Validation](#utility-guide-data-validation) | ||
| - [Table of Contents](#table-of-contents) | ||
| - [How This Works](#how-this-works) | ||
| - [Using the Data Validation Utility](#using-the-data-validation-utility) | ||
| - [Example usage](#example-usage) | ||
|
|
||
| ## How This Works | ||
|
|
||
| This utility first runs the SQL query and then uses functionality from pandas to check if there are duplicate records.<br> | ||
| If duplicate records are detected, then it will remove them and re-run the query to replace the dropped records. | ||
|
|
||
| ## Using the Data Validation Utility | ||
|
|
||
| To use this utility import the `DataValidation` class and then call the method `check_for_duplicate_records()`.<br> | ||
| Here you will need to provide the SQL query as a multiple line string ensuring that the final line has: **fetch first :subjects_to_retrieve rows only**.<br> | ||
| This is necessary as this line is later replaced with an offset if duplicates are found.<br> | ||
| You will also need to prove any parameters used in the query as a dictionary | ||
|
|
||
| ## Example usage | ||
|
|
||
| from utils.data_validation import DataValidation | ||
|
|
||
| def get_kit_id_from_db( | ||
| tk_type_id: int, hub_id: int, no_of_kits_to_retrieve: int | ||
| ) -> pd.DataFrame: | ||
|
|
||
| query = """select tk.kitid, tk.screening_subject_id, sst.subject_nhs_number | ||
| from tk_items_t tk | ||
| inner join ep_subject_episode_t se on se.screening_subject_id = tk.screening_subject_id | ||
| inner join screening_subject_t sst on (sst.screening_subject_id = tk.screening_subject_id) | ||
| inner join sd_contact_t sdc on (sdc.nhs_number = sst.subject_nhs_number) | ||
| where tk.tk_type_id = :tk_type_id | ||
| and tk.logged_in_flag = 'N' | ||
| and sdc.hub_id = :hub_id | ||
| and device_id is null | ||
| and tk.invalidated_date is null | ||
| and se.latest_event_status_id in (:s10_event_status, :s19_event_status) | ||
| order by tk.kitid DESC | ||
| fetch first :subjects_to_retrieve rows only""" | ||
|
|
||
| params = { | ||
| "s10_event_status": SqlQueryValues.S10_EVENT_STATUS, | ||
| "s19_event_status": SqlQueryValues.S19_EVENT_STATUS, | ||
| "tk_type_id": tk_type_id, | ||
| "hub_id": hub_id, | ||
| "subjects_to_retrieve": no_of_kits_to_retrieve, | ||
| } | ||
|
|
||
| kit_id_df = DataValidation().check_for_duplicate_records(query, params) | ||
|
|
||
| return kit_id_df | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| import pandas as pd | ||
| import logging | ||
| from oracle.oracle import OracleDB | ||
|
|
||
|
|
||
| class DataValidation: | ||
| """ | ||
| This class will be used to validate that there are no duplicate records when obtaining test data. | ||
| """ | ||
|
|
||
| def __init__(self): | ||
| self.max_attempts = 5 | ||
|
|
||
| def check_for_duplicate_records(self, query: str, params: dict) -> pd.DataFrame: | ||
| """ | ||
| This method is used to firstly obtain the test data, and then to check if there are any duplicate records. | ||
|
|
||
| Args: | ||
| query (str): The SQL query you want to run | ||
| params (dict): A dictionary of any parameters in the sql query | ||
|
|
||
| Returns: | ||
| dataframe (pd.DataFrame): A dataframe containing 0 duplicate records | ||
| """ | ||
| wanted_subject_count = int(params["subjects_to_retrieve"]) | ||
|
|
||
| dataframe = OracleDB().execute_query(query, params) | ||
|
|
||
| attempts = 0 | ||
| while attempts < self.max_attempts: | ||
| logging.info(f"Checking for duplicates. On attempt: {attempts+1}") | ||
| duplicate_rows_count = int(dataframe.duplicated().sum()) | ||
|
|
||
| if duplicate_rows_count == 0: | ||
| logging.info("No duplicate records found") | ||
| return dataframe | ||
|
|
||
| logging.warning( | ||
| f"{duplicate_rows_count} duplicate records found. Dropping duplicates and retrying query." | ||
| ) | ||
| dataframe = dataframe.drop_duplicates() | ||
| attempts += 1 | ||
| dataframe = self.run_query_for_dropped_records( | ||
| dataframe, | ||
| query, | ||
| params, | ||
| duplicate_rows_count, | ||
| wanted_subject_count, | ||
| attempts, | ||
| ) | ||
|
|
||
| logging.error( | ||
| f"Maximum attempt limit of {self.max_attempts} reached. Returning dataframe with duplicates dropped and not replaced." | ||
| ) | ||
| dataframe = dataframe.drop_duplicates() | ||
| actual_subject_count = len(dataframe) | ||
| if wanted_subject_count != actual_subject_count: | ||
| logging.error( | ||
| f"Actual subject count differs to wanted count. {wanted_subject_count} subjects wanted but only {actual_subject_count} subjects were retrieved" | ||
| ) | ||
| return dataframe | ||
|
|
||
| def run_query_for_dropped_records( | ||
| self, | ||
| dataframe: pd.DataFrame, | ||
| query: str, | ||
| params: dict, | ||
| duplicate_count: int, | ||
| wanted_subject_count: int, | ||
| attempts: int, | ||
| ) -> pd.DataFrame: | ||
| """ | ||
| This is used to make up for any dropped duplicate records. It runs the same query again but only returns the amount of dropped records. | ||
|
|
||
| Args: | ||
| dataframe (pd.DataFrame): The dataframe with duplicates dropped | ||
| query (str): The SQL query you want to run | ||
| params (dict): A dictionary of any parameters in the sql query | ||
| duplicate_count (int): The number of duplicate records in the original dataframe | ||
| wanted_subject_count (int): The number of subjects to retrieve in the original query | ||
| attempts (int): The number of attempts so far | ||
|
|
||
| Returns: | ||
| dataframe_without_duplicates (pd.DataFrame): A dataframe matching the original record count | ||
| """ | ||
| params["offset_value"] = wanted_subject_count + attempts | ||
| params["subjects_to_retrieve"] = duplicate_count | ||
|
|
||
| query = query.strip().split("\n") | ||
| query[-1] = ( | ||
| "OFFSET :offset_value ROWS FETCH FIRST :subjects_to_retrieve rows only" | ||
| ) | ||
| query = "\n".join(query) | ||
|
|
||
| dataframe_with_new_subjects = OracleDB().execute_query(query, params) | ||
|
|
||
| combined_dataframe = pd.concat( | ||
| [dataframe, dataframe_with_new_subjects], ignore_index=True | ||
| ) | ||
| return combined_dataframe |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.