Skip to content

Commit b66dc1b

Browse files
committed
- allow configuring feature flag default
- set 'dedupe.default_enabled: true' to enable dedupe by default for self-deploy users - docs: add info admonitions noting dedupe is in beta, may not yet be available to all users
1 parent 27c50de commit b66dc1b

File tree

6 files changed

+31
-2
lines changed

6 files changed

+31
-2
lines changed

backend/btrixcloud/models.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,8 @@
3636

3737
from .db import BaseMongoModel
3838

39+
from .utils import is_bool
40+
3941
# num browsers per crawler instance
4042
NUM_BROWSERS = int(os.environ.get("NUM_BROWSERS", 2))
4143

@@ -65,6 +67,12 @@
6567
# Minimum part size for file uploads
6668
MIN_UPLOAD_PART_SIZE = 10000000
6769

70+
# enable dedupe by default
71+
DEDUPE_FEATURE_ENABLED_DEFAULT = is_bool(
72+
os.environ.get("DEDUPE_FEATURE_ENABLED_DEFAULT")
73+
)
74+
75+
6876
# annotated types
6977
# ============================================================================
7078

@@ -2325,7 +2333,7 @@ class FeatureFlags(ValidatedFeatureFlags):
23252333

23262334
dedupeEnabled: bool = Field(
23272335
description="Enable deduplication options for an org. Intended for beta-testing dedupe.",
2328-
default=False,
2336+
default=DEDUPE_FEATURE_ENABLED_DEFAULT,
23292337
)
23302338

23312339

chart/templates/configmap.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -114,6 +114,8 @@ data:
114114

115115
ENABLE_AUTO_RESIZE_INDEX_STORAGE: "{{ .Values.dedupe.enable_auto_resize }}"
116116

117+
DEDUPE_FEATURE_ENABLED_DEFAULT: "{{ .Values.dedupe.default_enabled }}"
118+
117119

118120
{{- if .Values.available_plans }}
119121
AVAILABLE_PLANS: {{ .Values.available_plans | toJson }}

chart/values.yaml

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,7 @@ crawler_extra_args: ""
5555
# max allowed browser windows per crawl
5656
max_browser_windows: 8
5757

58+
5859
# Cluster Settings
5960
# =========================================
6061
name: browsertrix-cloud
@@ -250,7 +251,7 @@ redis_memory: "200Mi"
250251
redis_storage: "3Gi"
251252

252253

253-
# Redis Dedup Index
254+
# Dedupe Index
254255
# =========================================
255256
dedupe:
256257
backend_type: kvrocks
@@ -261,6 +262,9 @@ dedupe:
261262
# backend_type: redis
262263
# dedupe_image: redis
263264

265+
# enabled by default without feature flag
266+
default_enabled: true
267+
264268
memory: "1Gi"
265269
cpu: "100m"
266270

frontend/docs/docs/user-guide/deduplication.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,10 @@
11
# Deduplication
22

3+
!!! info "Deduplication is in Beta"
4+
5+
As of current release, the feature is still beta and may not be available to all users.
6+
If you don't see the options below, consult your admin or reach out to support to request access.
7+
38
## Overview
49

510
Deduplication (or “dedupe”) is the process of preventing duplicate content from being stored during crawling. In Browsertrix, deduplication is facilitated through [collections](./collection.md), which allow arbitrary grouping of crawled content as needed.

frontend/docs/docs/user-guide/org-settings.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,11 @@ Set default suggested settings for all new crawl workflows. When creating a new
3131

3232
## Deduplication
3333

34+
!!! info "Deduplication is in Beta"
35+
36+
As of current release, the feature is still beta and may not be available to all users.
37+
If you don't see the options below, consult your admin or reach out to support to request access.
38+
3439
View and manage deduplication indexes for all collections used as [deduplication sources](deduplication.md) in the org. Each entry includes information such as how many archived items and URLs are included in the index and how many deleted archived items are purgeable from the index. From the action menu, purge or delete the deduplication index for each collection.
3540

3641
<!-- ## Limits

frontend/docs/docs/user-guide/workflow-setup.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -427,6 +427,11 @@ Cron schedules are always in [UTC](https://en.wikipedia.org/wiki/Coordinated_Uni
427427

428428
## Deduplication
429429

430+
!!! info "Deduplication is in Beta"
431+
432+
As of current release, the feature is still beta and may not be available to all users.
433+
If you don't see the options below, consult your admin or reach out to support to request access.
434+
430435
Prevent duplicate content from being crawled and stored.
431436

432437
### Crawl Deduplication

0 commit comments

Comments
 (0)