Skip to content

Commit 36d0930

Browse files
authored
updates to de-duplication docs (#1644)
1. Fix minor typo ("point to remember"). 2. Correct reference to de-duplication service instead of api. 3. Consistently use de-duplicat(e|ion|ed|ing). It does sound a bit repetitive all formaally spelled out, so I'd be fine removing these changes actually. 4. api -> API
1 parent 9831d23 commit 36d0930

File tree

1 file changed

+3
-3
lines changed

1 file changed

+3
-3
lines changed

src/guides/duplicate-data.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,10 +2,10 @@
22
title: How does Segment handle duplicate data?
33
---
44

5-
Segment has a special de-duplication service that sits just behind the `api.segment.com` endpoint, and attempts to drop duplicate data. However, that de-duplication api has to hold the entire set of events in memory in order to know whether or not it has seen that event already. Segment stores 24 hours worth of event `message_id`s. This means Segment can de-duplicate any data that appears within a 24 hour rolling window.
5+
Segment has a special de-duplication service that sits just behind the `api.segment.com` endpoint, and attempts to drop duplicate data. However, that de-duplication service has to hold the entire set of events in memory in order to know whether or not it has seen that event already. Segment stores 24 hours worth of event `message_id`s. This means Segment can de-duplicate any data that appears within a 24 hour rolling window.
66

7-
An important point remember is that Segment de-duplicates on the event's `message_id`, _not_ on the contents of an event payload. So if you aren't generating `message_id`s for each event, or are trying to deduplicate data over a longer period than 24 hours, Segment does not have a built-in way to de-duplicate data.
7+
An important point to remember is that Segment de-duplicates on the event's `message_id`, _not_ on the contents of an event payload. So if you aren't generating `message_id`s for each event, or are trying to de-duplicate data over a longer period than 24 hours, Segment does not have a built-in way to de-duplicate data.
88

9-
Since the api layer is de-duping during this window, duplicate events that are further than 24 hours apart from one another must be de-duped in the Warehouse. Segment also dedupes messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse. Note that in these cases you will see duplications in end tools as there is no additional layer prior to sending the event to downstream tools.
9+
Since the API layer is de-duplicating during this window, duplicate events that are further than 24 hours apart from one another must be de-duplicated in the Warehouse. Segment also de-duplicates messages going into a Warehouse based on the `message_id`, which is the `id` column in a Segment Warehouse. Note that in these cases you will see duplications in end tools as there is no additional layer prior to sending the event to downstream tools.
1010

1111
Keep in mind that Segment's libraries all generate `message_id`s for you for each event payload, with the exception of the Segment HTTP API, which assigns each event a unique `message_id` when the message is ingested. You can override these default generated IDs and manually assign a `message_id` if necessary.

0 commit comments

Comments
 (0)