Skip to content

Commit d347c11

Browse files
committed
Adding dedupe example
1 parent 8e8cd08 commit d347c11

File tree

4 files changed

+12
-2
lines changed

4 files changed

+12
-2
lines changed

articles/data-factory/data-flow-aggregate.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,16 @@ Aggregate transformations are similar to SQL aggregate select queries. Columns t
4141
* Use an aggregate function such as `last()` or `first()` to include that additional column.
4242
* Rejoin the columns to your output stream using the [self join pattern](https://mssqldude.wordpress.com/2018/12/20/adf-data-flows-self-join/).
4343

44+
## Removing duplicate rows
45+
46+
A common use of the aggregate transformation is removing or identifying duplicate entries in source data. This process is known as deduplication. Based upon a set of group by keys, use a heuristic of your choosing to determine which duplicate row to keep. Common heuristics are `first()`, `last()`, `max()`, and `min()`. Use [column patterns](concepts-data-flow-column-pattern.md) to apply the rule to every column except for the group by columns.
47+
48+
![Deduplication](media/data-flow/agg-dedupe.png "Deduplication")
49+
50+
In the above example, columns `ProductID` and `Name` are being use for grouping. If two rows have the same values for those two columns, they're considered duplicates. In this aggregate transformation, the values of the first row matched will be kept and all others will be dropped. Using column pattern syntax, all columns whose names aren't `ProductID` and `Name` are mapped to their existing column name and given the value of the first matched rows.
51+
52+
For data validation scenarios, the `count()` function can be used to count how many duplicates there are.
53+
4454
## Data flow script
4555

4656
### Syntax
@@ -77,7 +87,7 @@ The data flow script for this transformation is in the snippet below.
7787
```
7888
MoviesYear aggregate(
7989
groupBy(year),
80-
avgrating = avg(toInteger(Rating))
90+
avgrating = avg(toInteger(Rating))
8191
) ~> AvgComedyRatingByYear
8292
```
8393

articles/data-factory/data-flow-lookup.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ Enabling broadcasting pushes the entire dataset into memory. For smaller dataset
7070
```
7171
### Example
7272

73-
![Lookup Transformation](media/data-flow/lookup1.png "Lookup")
73+
![Lookup Transformation](media/data-flow/lookup-dsl-example.png "Lookup")
7474

7575
The data flow script for the above lookup configuration is in the code snippet below.
7676

89.9 KB
Loading
110 KB
Loading

0 commit comments

Comments
 (0)