Skip to content

Commit 2eb408b

Browse files
committed
Updates to cleaning data
1 parent fae791a commit 2eb408b

File tree

7 files changed

+8
-13
lines changed

7 files changed

+8
-13
lines changed

learn-pr/azure/building-end-to-end-data-governance-master-data-stack-with-microsoft-purview-cluedin/includes/10-clean-data-quality-issues.md

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -6,30 +6,25 @@ For this, we'll be using the CluedIn data steward tool: CluedIn Clean.
66

77
:::image type="content" source="../media/clean.png" alt-text="Screenshot of the CluedIn preparation window.":::
88

9-
1. Select **Create Project** and choose what records and columns you want to clean.
9+
1. Select **Create Project** and choose the entity type. For this example, we'll choose **Employee**.
1010

11-
1. For the filter, choose **Entity Type equals Employee**.
11+
1. In the new cleaning project, select **Create new clean project**.
1212

13-
1. In the new cleaning project, select **Generate Project**.
13+
1. Once this is finished, select the new clean project from the menu, which will launch a studio in a new tab with your employee records.
1414

15-
1. Once this is finished you'll received a link to "Clean" the data. Selecting the **Clean** button will launch a studio in a new tab with your 10 Employee records. You'll see a column name for the Origin Entity Code and the person.Job column.
15+
1. On the employee.job header, select the drop-down, select **Facet**, and select **Text Facet**.
1616

17-
>[!NOTE]
18-
> Don't delete the column name for the Origin Entity Code as it is the reference of what to save these records back to.
17+
:::image type="content" source="../media/clean-text-facet.png" alt-text="Screenshot of the ClueIn project window, showing the header dropdown with facet and text facet selected.":::
1918

20-
1. On the person.job header, select the drop-down, select **Facet**, and select **Text Facet**.
21-
22-
:::image type="content" source="../media/Clean_Text_Facet.png" alt-text="Screenshot of the ClueIn project window, showing the header dropdown with facet and text facet selected.":::
23-
24-
1. On the left hand side you'll see that CluedIn shows an aggregation of all of the unique values of that column and then a count next to each item to reflect how many rows share a column value.
19+
1. On the left hand side, you'll see that CluedIn shows an aggregation of all of the unique values of that column and then a count next to each item to reflect how many rows share a column value.
2520

2621
1. Select the **Cluster** button. CluedIn will show a prompt that will suggest where the data quality issues lie, and the proposed solution on what to normalize the values to.
2722

2823
1. From the dropdown, choose the **Keying function** option. Choose the **metaphone3** option in the subsequent dropdown.
2924

3025
You'll notice that CluedIn is recommending that all of the different spelling of Accounting on the left and proposing that they're all normalized into **Accounting** on the right. Accept this suggestion and the one for Software Dev.
3126

32-
:::image type="content" source="../media/CleaN_Keying_Function.png" alt-text="Screenshots of the Cluster & Edit column person.job page.":::
27+
:::image type="content" source="../media/clean-keying-function.png" alt-text="Screenshots of the Cluster & Edit column person.job page.":::
3328

3429
1. Cycle through all the other Keying functions and their suggestions until all the values are normalized and there are now only two permutations of the Job titles that we had in the original raw data.
3530

@@ -47,7 +42,7 @@ This exercise above has yielded a few elements, including:
4742

4843
1. Go back to the automated rules that were constructed and select them all and toggle to activate them.
4944

50-
:::image type="content" source="../media/Rules_Created.png" alt-text="Screenshot of the rules in CluedIn that can be toggled to be activated.":::
45+
:::image type="content" source="../media/rules-created.png" alt-text="Screenshot of the rules in CluedIn that can be toggled to be activated.":::
5146

5247
1. Return to the data sources in CluedIn, and map the final file called ContactsAddLater.csv that had the same data quality issues in it, but this time, just process the data directly and don't clean it at all.
5348

Loading
Loading
Loading

0 commit comments

Comments
 (0)