-
Notifications
You must be signed in to change notification settings - Fork 32
fix: (CDK) (Manifest) - Add Manifest Normalization module (reduce commonalities + handle schema $refs)
#447
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
37 commits
Select commit
Hold shift + click to select a range
a488ab3
deduplication version 1
7d910ee
deduplication version 2
691d16a
updated duplicates collection
081e7a8
deduplicate most frequent tags, use existing refs if definitions.shar…
180af86
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
138b607
formatted"
f10e601
updated to account type for the given duplicated key
66fe38e
add the reduce_commons: true, for Connector Builder case
8798042
enabled the reduce_commons: True for Connector Builder case
1d425ee
refactorred and cleaned up the code, moved to use the class instead
06b183a
formatted
1fa891c
formatted
00e31a7
cleaned up
a5aba82
added the dedicated tests
e017e92
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
0e8394f
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
9f7d498
formatted
6ec240a
updated normalizer
acdecdb
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
5f5c6b1
attempt to fix the Connector Builder tests
e97afa5
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
be3bab1
revert test
748892d
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
b10d7a1
removed post_resolve_manifest flag
0587481
nit
d929167
add _-should_normalize flag handling
3859c5b
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
9de27ef
formatted
c403a0e
rename sharable > linkable, shared > linked
297ae37
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
38f7da6
updated the order of operations; normalization should go after pre-pr…
7d71f4b
fixed
304235c
add schema extraction + unit test
348aaae
Merge branch 'main' into baz/cdk/extract-common-manifest-parts
2c8d164
updated test comments
2010419
Merge remote-tracking branch 'origin/main' into baz/cdk/extract-commo…
8d7be4e
updated linked
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
262 changes: 262 additions & 0 deletions
262
airbyte_cdk/sources/declarative/parsers/manifest_deduplicator.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,262 @@ | ||
| # | ||
| # Copyright (c) 2023 Airbyte, Inc., all rights reserved. | ||
| # | ||
|
|
||
| import copy | ||
| import hashlib | ||
| import json | ||
| from collections import defaultdict | ||
| from typing import Any, DefaultDict, Dict, List, Optional, Tuple | ||
|
|
||
| from airbyte_cdk.sources.declarative.parsers.custom_exceptions import ManifestDeduplicationException | ||
|
|
||
| # Type definitions for better readability | ||
| ManifestType = Dict[str, Any] | ||
| DefinitionsType = Dict[str, Any] | ||
| DuplicatesType = DefaultDict[str, List[Tuple[List[str], Dict[str, Any], Dict[str, Any]]]] | ||
|
|
||
| # Configuration constants | ||
| N_OCCURANCES = 2 | ||
|
|
||
| DEF_TAG = "definitions" | ||
| SHARED_TAG = "shared" | ||
|
|
||
| # SPECIFY TAGS FOR DEDUPLICATION | ||
| TAGS = [ | ||
| "authenticator", | ||
| "url_base", | ||
| ] | ||
|
|
||
|
|
||
| def deduplicate_definitions(resolved_manifest: ManifestType) -> ManifestType: | ||
| """ | ||
| Find commonalities in the input JSON structure and refactor it to avoid redundancy. | ||
|
|
||
| Args: | ||
| resolved_manifest: A dictionary representing a JSON structure to be analyzed. | ||
|
|
||
| Returns: | ||
| A refactored JSON structure with common properties extracted to `definitions.shared`, | ||
| the duplicated properties replaced with references | ||
| """ | ||
|
|
||
| try: | ||
| _manifest = copy.deepcopy(resolved_manifest) | ||
| definitions = _manifest.get(DEF_TAG, {}) | ||
|
|
||
| duplicates = _collect_duplicates(definitions) | ||
| _handle_duplicates(definitions, duplicates) | ||
|
|
||
| return _manifest | ||
| except ManifestDeduplicationException: | ||
| # if any arror occurs, we just return the original manifest. | ||
| return resolved_manifest | ||
|
|
||
|
|
||
| def _replace_duplicates_with_refs(definitions: ManifestType, duplicates: DuplicatesType) -> None: | ||
| """ | ||
| Process duplicate objects and replace them with references. | ||
|
|
||
| Args: | ||
| definitions: The definitions dictionary to modify | ||
| """ | ||
| for _, occurrences in duplicates.items(): | ||
| # Skip non-duplicates | ||
| if len(occurrences) < N_OCCURANCES: | ||
| continue | ||
|
|
||
| # Take the value from the first occurrence, as they are the same | ||
| path, _, value = occurrences[0] | ||
| # take the component's name as the last part of it's path | ||
| key = path[-1] | ||
| # Create a meaningful reference key | ||
| ref_key = _create_reference_key(definitions, key) | ||
| # Add to definitions | ||
| _add_to_shared_definitions(definitions, ref_key, value) | ||
|
|
||
| # Replace all occurrences with references | ||
| for path, parent_obj, _ in occurrences: | ||
| if path: # Make sure the path is valid | ||
| key = path[-1] | ||
| parent_obj[key] = _create_ref_object(ref_key) | ||
|
|
||
|
|
||
| def _handle_duplicates(definitions: DefinitionsType, duplicates: DuplicatesType) -> None: | ||
| """ | ||
| Process the duplicates and replace them with references. | ||
|
|
||
| Args: | ||
| duplicates: Dictionary of duplicate objects | ||
| """ | ||
| # process duplicates only if there are any | ||
| if len(duplicates) > 0: | ||
| if not SHARED_TAG in definitions: | ||
| definitions[SHARED_TAG] = {} | ||
|
|
||
| try: | ||
| _replace_duplicates_with_refs(definitions, duplicates) | ||
| except Exception as e: | ||
| raise ManifestDeduplicationException(str(e)) | ||
|
|
||
|
|
||
| def _is_allowed_tag(key: str) -> bool: | ||
| """ | ||
| Check if the key is an allowed tag for deduplication. | ||
|
|
||
| Args: | ||
| key: The key to check | ||
|
|
||
| Returns: | ||
| True if the key is allowed, False otherwise | ||
| """ | ||
| return key in TAGS | ||
|
|
||
|
|
||
| def _add_duplicate( | ||
| duplicates: DuplicatesType, | ||
| current_path: List[str], | ||
| obj: Dict[str, Any], | ||
| value: Any, | ||
| key: Optional[str] = None, | ||
| ) -> None: | ||
| """ | ||
| Adds a duplicate record of an observed object by computing a unique hash for the provided value. | ||
|
|
||
| This function computes a hash for the given value (or a dictionary composed of the key and value if a key is provided) | ||
| and appends a tuple containing the current path, the original object, and the value to the duplicates | ||
| dictionary under the corresponding hash. | ||
|
|
||
| Parameters: | ||
| duplicates (DuplicatesType): The dictionary to store duplicate records. | ||
| current_path (List[str]): The list of keys or indices representing the current location in the object hierarchy. | ||
| obj (Dict): The original dictionary object where the duplicate is observed. | ||
| value (Any): The value to be hashed and used for identifying duplicates. | ||
| key (Optional[str]): An optional key that, if provided, wraps the value in a dictionary before hashing. | ||
| """ | ||
| # create hash for the duplicate observed | ||
| value_to_hash = value if key is None else {key: value} | ||
| obj_hash = _hash_object(value_to_hash) | ||
| if obj_hash: | ||
| duplicates[obj_hash].append((current_path, obj, value)) | ||
|
|
||
|
|
||
| def _add_to_shared_definitions( | ||
| definitions: DefinitionsType, | ||
| key: str, | ||
| value: Any, | ||
| ) -> DefinitionsType: | ||
| """ | ||
| Add a value to the shared definitions under the specified key. | ||
|
|
||
| Args: | ||
| definitions: The definitions dictionary to modify | ||
| key: The key to use | ||
| value: The value to add | ||
| """ | ||
|
|
||
| if key not in definitions[SHARED_TAG]: | ||
| definitions[SHARED_TAG][key] = value | ||
|
|
||
| return definitions | ||
|
|
||
|
|
||
| def _collect_duplicates(node: ManifestType) -> DuplicatesType: | ||
| """ | ||
| Traverse the JSON object and collect all potential duplicate values and objects. | ||
|
|
||
| Args: | ||
| node: The JSON object to analyze. | ||
|
|
||
| Returns: | ||
| duplicates: A dictionary of duplicate objects. | ||
| """ | ||
|
|
||
| def _collect(obj: Dict[str, Any], path: Optional[List[str]] = None) -> None: | ||
| """ | ||
| The closure to recursively collect duplicates in the JSON object. | ||
|
|
||
| Args: | ||
| obj: The current object being analyzed. | ||
| path: The current path in the object hierarchy. | ||
| """ | ||
| if not isinstance(obj, dict): | ||
| return | ||
|
|
||
| path = [] if path is None else path | ||
| # Check if the object is empty | ||
| for key, value in obj.items(): | ||
| current_path = path + [key] | ||
|
|
||
| if isinstance(value, dict): | ||
| # First process nested dictionaries | ||
| _collect(value, current_path) | ||
| # Process allowed-only component tags | ||
| if _is_allowed_tag(key): | ||
| _add_duplicate(duplicates, current_path, obj, value) | ||
|
|
||
| # handle primitive types | ||
| elif isinstance(value, (str, int, float, bool)): | ||
| # Process allowed-only field tags | ||
| if _is_allowed_tag(key): | ||
| _add_duplicate(duplicates, current_path, obj, value, key) | ||
|
|
||
| # handle list cases | ||
| elif isinstance(value, list): | ||
| for i, item in enumerate(value): | ||
| _collect(item, current_path + [str(i)]) | ||
|
|
||
| duplicates: DuplicatesType = defaultdict(list, {}) | ||
|
|
||
| try: | ||
| _collect(node) | ||
| return duplicates | ||
| except Exception as e: | ||
| raise ManifestDeduplicationException(str(e)) | ||
|
|
||
|
|
||
| def _hash_object(node: Dict[str, Any]) -> Optional[str]: | ||
| """ | ||
| Create a unique hash for a dictionary object. | ||
|
|
||
| Args: | ||
| node: The dictionary to hash | ||
|
|
||
| Returns: | ||
| A hash string or None if not hashable | ||
| """ | ||
| if isinstance(node, Dict): | ||
| # Sort keys to ensure consistent hash for same content | ||
| return hashlib.md5(json.dumps(node, sort_keys=True).encode()).hexdigest() | ||
| return None | ||
|
|
||
|
|
||
| def _create_reference_key(definitions: DefinitionsType, key: str) -> str: | ||
| """ | ||
| Create a unique reference key and handle collisions. | ||
|
|
||
| Args: | ||
| key: The base key to use | ||
| definitions: The definitions dictionary with definitions | ||
|
|
||
| Returns: | ||
| A unique reference key | ||
| """ | ||
|
|
||
| counter = 1 | ||
| while key in definitions[SHARED_TAG]: | ||
| key = f"{key}_{counter}" | ||
| counter += 1 | ||
| return key | ||
|
|
||
|
|
||
| def _create_ref_object(ref_key: str) -> Dict[str, str]: | ||
| """ | ||
| Create a reference object using the specified key. | ||
|
|
||
| Args: | ||
| ref_key: The reference key to use | ||
|
|
||
| Returns: | ||
| A reference object in the proper format | ||
| """ | ||
| return {"$ref": f"#/{DEF_TAG}/{SHARED_TAG}/{ref_key}"} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.