Skip to content

Commit ea06329

Browse files
authored
Merge pull request #108821 from djpmsft/docUpdates
Adding deduplication example
2 parents c14cb46 + 8653418 commit ea06329

File tree

4 files changed

+13
-3
lines changed

4 files changed

+13
-3
lines changed

articles/data-factory/data-flow-aggregate.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ms.reviewer: daperlov
77
ms.service: data-factory
88
ms.topic: conceptual
99
ms.custom: seo-lt-2019
10-
ms.date: 10/15/2019
10+
ms.date: 03/24/2020
1111
---
1212

1313
# Aggregate transformation in mapping data flow
@@ -41,6 +41,16 @@ Aggregate transformations are similar to SQL aggregate select queries. Columns t
4141
* Use an aggregate function such as `last()` or `first()` to include that additional column.
4242
* Rejoin the columns to your output stream using the [self join pattern](https://mssqldude.wordpress.com/2018/12/20/adf-data-flows-self-join/).
4343

44+
## Removing duplicate rows
45+
46+
A common use of the aggregate transformation is removing or identifying duplicate entries in source data. This process is known as deduplication. Based upon a set of group by keys, use a heuristic of your choosing to determine which duplicate row to keep. Common heuristics are `first()`, `last()`, `max()`, and `min()`. Use [column patterns](concepts-data-flow-column-pattern.md) to apply the rule to every column except for the group by columns.
47+
48+
![Deduplication](media/data-flow/agg-dedupe.png "Deduplication")
49+
50+
In the above example, columns `ProductID` and `Name` are being use for grouping. If two rows have the same values for those two columns, they're considered duplicates. In this aggregate transformation, the values of the first row matched will be kept and all others will be dropped. Using column pattern syntax, all columns whose names aren't `ProductID` and `Name` are mapped to their existing column name and given the value of the first matched rows. The output schema is the same as the input schema.
51+
52+
For data validation scenarios, the `count()` function can be used to count how many duplicates there are.
53+
4454
## Data flow script
4555

4656
### Syntax
@@ -77,7 +87,7 @@ The data flow script for this transformation is in the snippet below.
7787
```
7888
MoviesYear aggregate(
7989
groupBy(year),
80-
avgrating = avg(toInteger(Rating))
90+
avgrating = avg(toInteger(Rating))
8191
) ~> AvgComedyRatingByYear
8292
```
8393

articles/data-factory/data-flow-lookup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Enabling broadcasting pushes the entire dataset into memory. For smaller dataset
7070
```
7171
### Example
7272

73-
![Lookup Transformation](media/data-flow/lookup1.png "Lookup")
73+
![Lookup Transformation](media/data-flow/lookup-dsl-example.png "Lookup")
7474

7575
The data flow script for the above lookup configuration is in the code snippet below.
7676

89.9 KB
Loading
110 KB
Loading

0 commit comments

Comments
 (0)