You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: learn-pr/azure/building-end-to-end-data-governance-master-data-stack-with-microsoft-purview-cluedin/includes/10-clean-data-quality-issues.md
+8-13Lines changed: 8 additions & 13 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,30 +6,25 @@ For this, we'll be using the CluedIn data steward tool: CluedIn Clean.
6
6
7
7
:::image type="content" source="../media/clean.png" alt-text="Screenshot of the CluedIn preparation window.":::
8
8
9
-
1. Select **Create Project** and choose what records and columns you want to clean.
9
+
1. Select **Create Project** and choose the entity type. For this example, we'll choose **Employee**.
10
10
11
-
1.For the filter, choose**Entity Type equals Employee**.
11
+
1.In the new cleaning project, select**Create new clean project**.
12
12
13
-
1.In the new cleaning project, select **Generate Project**.
13
+
1.Once this is finished, select the new clean project from the menu, which will launch a studio in a new tab with your employee records.
14
14
15
-
1.Once this is finished you'll received a link to "Clean" the data. Selecting the **Clean** button will launch a studio in a new tab with your 10 Employee records. You'll see a column name for the Origin Entity Code and the person.Job column.
15
+
1.On the employee.job header, select the drop-down, select **Facet**, and select **Text Facet**.
16
16
17
-
>[!NOTE]
18
-
> Don't delete the column name for the Origin Entity Code as it is the reference of what to save these records back to.
17
+
:::image type="content" source="../media/clean-text-facet.png" alt-text="Screenshot of the ClueIn project window, showing the header dropdown with facet and text facet selected.":::
19
18
20
-
1. On the person.job header, select the drop-down, select **Facet**, and select **Text Facet**.
21
-
22
-
:::image type="content" source="../media/Clean_Text_Facet.png" alt-text="Screenshot of the ClueIn project window, showing the header dropdown with facet and text facet selected.":::
23
-
24
-
1. On the left hand side you'll see that CluedIn shows an aggregation of all of the unique values of that column and then a count next to each item to reflect how many rows share a column value.
19
+
1. On the left hand side, you'll see that CluedIn shows an aggregation of all of the unique values of that column and then a count next to each item to reflect how many rows share a column value.
25
20
26
21
1. Select the **Cluster** button. CluedIn will show a prompt that will suggest where the data quality issues lie, and the proposed solution on what to normalize the values to.
27
22
28
23
1. From the dropdown, choose the **Keying function** option. Choose the **metaphone3** option in the subsequent dropdown.
29
24
30
25
You'll notice that CluedIn is recommending that all of the different spelling of Accounting on the left and proposing that they're all normalized into **Accounting** on the right. Accept this suggestion and the one for Software Dev.
31
26
32
-
:::image type="content" source="../media/CleaN_Keying_Function.png" alt-text="Screenshots of the Cluster & Edit column person.job page.":::
27
+
:::image type="content" source="../media/clean-keying-function.png" alt-text="Screenshots of the Cluster & Edit column person.job page.":::
33
28
34
29
1. Cycle through all the other Keying functions and their suggestions until all the values are normalized and there are now only two permutations of the Job titles that we had in the original raw data.
35
30
@@ -47,7 +42,7 @@ This exercise above has yielded a few elements, including:
47
42
48
43
1. Go back to the automated rules that were constructed and select them all and toggle to activate them.
49
44
50
-
:::image type="content" source="../media/Rules_Created.png" alt-text="Screenshot of the rules in CluedIn that can be toggled to be activated.":::
45
+
:::image type="content" source="../media/rules-created.png" alt-text="Screenshot of the rules in CluedIn that can be toggled to be activated.":::
51
46
52
47
1. Return to the data sources in CluedIn, and map the final file called ContactsAddLater.csv that had the same data quality issues in it, but this time, just process the data directly and don't clean it at all.
0 commit comments