Skip to content

Commit 14484f1

Browse files
authored
Merge branch 'main' into ishamehramixpanel-patch-5
2 parents 5ba55df + a5fecb3 commit 14484f1

File tree

36 files changed

+910
-440
lines changed

36 files changed

+910
-440
lines changed

.github/workflows/cspell.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ jobs:
1010
runs-on: ubuntu-latest
1111
steps:
1212
- uses: actions/checkout@v4
13-
- uses: streetsidesoftware/cspell-action@v6
13+
- uses: streetsidesoftware/cspell-action@v7
1414
with:
1515
# Define glob patterns to filter the files to be checked. Use a new line between patterns to define multiple patterns.
1616
# The default is to check ALL files that were changed in in the pull_request or push.
Lines changed: 62 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,28 @@
1-
Event deduplication allows a project to send the same exact event while only recording that event once.
2-
Deduplication only occurs when a subset of the event data is exactly identical.
1+
Mixpanel provides an event deduplication mechanism to ensure that duplicate events do not skew your analytics. Deduplication is essential when events may be sent multiple times due to network retries, client-side batching, or integration with multiple data sources.
2+
3+
<br />
4+
5+
## How Deduplication Works
6+
7+
Mixpanel deduplicates events using a combination of four key event properties:
8+
9+
- Event Name (`event`)
10+
- Distinct ID (`distinct_id`)
11+
- Timestamp (`time`)
12+
- Insert ID (`$insert_id`)
13+
14+
If all four of these properties are identical across two or more events, Mixpanel considers them duplicates and will only show the most recent version of that event in your reports. This applies regardless of whether the events are sent via SDKs, APIs, or other integrations.
15+
16+
The `$insert_id` should be a randomly generated, unique value for each event to ensure proper deduplication. If `$insert_id` are reused, events may be unintentionally deduplicated.
17+
18+
Only the four key event properties listed above are used for deduplication. Additional event properties are not considered for the deduplication mechanism. For example, if two events share the same Event Name, Distinct ID, Timestamp, and Insert ID, but have different $city value, they are still considered duplicate events.
19+
20+
### Deduplication Example
21+
22+
Deduplication occurs when a subset of the event data (event name, distinct_id, timestamp, $insert_id) is identical. Other event properties are not considered.
323

424
**Required [Event Object](doc:data-model#anatomy-of-an-event) attributes**
25+
526
[block:parameters]
627
{
728
"data": {
@@ -13,6 +34,7 @@ Deduplication only occurs when a subset of the event data is exactly identical.
1334
"0-2": "A name for the event. For example, \"Signed up\", or \"Uploaded Photo\".",
1435
"1-0": "**properties**",
1536
"1-1": "<span style=\"font-family: courier\">Object</span></br><span style=\"color: red\">required</span>",
37+
"1-2": "",
1638
"2-0": "**properties.distinct_id**",
1739
"2-1": "<span style=\"font-family: courier\">String</span></br><span style=\"color: red\">required</span>",
1840
"2-2": "The value of `distinct_id` will be treated as a string, and used to uniquely identify a user associated with your event. If you provide a distinct_id property with your events, you can track a given user through funnels and distinguish unique users for retention analyses. You should always send the same distinct_id when an event is triggered by the same user.",
@@ -27,32 +49,56 @@ Deduplication only occurs when a subset of the event data is exactly identical.
2749
"5-2": "A unique UUID tied to exactly one occurrence of an event."
2850
},
2951
"cols": 3,
30-
"rows": 6
52+
"rows": 6,
53+
"align": [
54+
"left",
55+
"left",
56+
"left"
57+
]
3158
}
3259
[/block]
3360

34-
In other words, each event containing an $insert_id is checked for duplication after being minimized to the following shape:
61+
62+
In other words, each event containing an `$insert_id` is checked for duplication after being minimized to the following shape:
3563

3664
```json
3765
{
38-
"event": "Back to Back",
66+
"event": "Item Purchased",
3967
"properties": {
40-
"token": "project_token",
41-
"distinct_id": "aubrey@thesix.views",
68+
"token": "my_project_token",
69+
"distinct_id": "user123xyz",
4270
"time": 1601412131000,
4371
"$insert_id": "88B7hahbaschhhB66cbsg"
4472
},
4573
}
4674
```
4775

48-
If this simplified object is an exact match to any other simplified event it is marked as a duplicate. Ingested events that have been marked as a duplicate will be deleted within 24 hours.
76+
If this minimized event object is an exact match to any other minimized event object, it is marked as a duplicate. Ingested events that have been marked as a duplicates will be deduplicated.
4977

50-
If an event is sent to the Ingestion API without an `$insert_id` one will be generated for it. However, it will not qualify for the deduplication process.
78+
If an event is sent to the Ingestion API without an `$insert_id`, one will be generated for it. However, it will not qualify for the deduplication process.
5179

52-
[block:callout]
53-
{
54-
"type": "warning",
55-
"title": "Deduplication does not rewrite data",
56-
"body": "Using $insert_id is only used to prevent duplicate event data. It cannot be used to update, replace, or delete existing events."
57-
}
58-
[/block]
80+
## Deduplication Mechanisms
81+
82+
Mixpanel uses two main deduplication processes:
83+
84+
### Query-Time Deduplication
85+
86+
- When: Happens immediately when you query data in the Mixpanel UI.
87+
- How: If multiple events share the same event_name, distinct_id, timestamp, and $insert_id, only the most recent version of the event is shown in reports (based on the API ingestion time). This ensures that duplicate events do not affect your analytics in real time.
88+
- Scope: This deduplication is visible in the Mixpanel UI and reports, but not in raw data exports. Raw event export will contain all data as they were ingested, without any deduplication.
89+
90+
### Compaction-Time Deduplication
91+
92+
- When: Runs periodically in the backend, typically after a few hours and again after about 20 days, once data ingestion for a day is complete.
93+
- How: During compaction, Mixpanel scans for events with the same event name, distinct_id, and $insert_id (timestamp does not need to match exactly, just the same calendar day). The older event is deleted, and only the latest remains in storage.
94+
- Scope: This process helps reduce storage of duplicate events and may affect event counts if duplicates were present with different timestamps
95+
96+
<br />
97+
98+
## Important Notes
99+
100+
**Raw Event Export** - Deduplication is not applied to raw data exports. If you export events via the API, you may see duplicates. It is recommended to apply the same deduplication logic (event name, distinct_id, timestamp, $insert_id) to your exported data
101+
102+
**Insert ID Best Practice** - Always generate a unique $insert_id for each event. Reusing $insert_id (e.g., setting it to the user’s distinct_id) can cause unintended deduplication and data loss
103+
104+
**Deduplication Timing** - Query-time deduplication is immediate. Compaction-time deduplication timing is not guaranteed and may take hours to days to complete.

openapi/src/ingestion.openapi.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -101,7 +101,7 @@ paths:
101101
time:
102102
type: integer
103103
title: time
104-
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch.
104+
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch. If the time value is set in the future, it will be overwritten with the current present time at ingestion.
105105
distinct_id:
106106
type: string
107107
title: distinct_id
@@ -163,7 +163,7 @@ paths:
163163
time:
164164
type: integer
165165
title: time
166-
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch.
166+
description: The time at which the event occurred, in seconds or milliseconds since UTC epoch. If the time value is set in the future, it will be overwritten with the current present time at ingestion.
167167
distinct_id:
168168
type: string
169169
title: distinct_id

0 commit comments

Comments
 (0)