Skip to content

Commit 16ae96a

Browse files
committed
further docs tweaks, moved tradeoofs and considerations to the bottom
1 parent 42b257d commit 16ae96a

File tree

2 files changed

+45
-20
lines changed

2 files changed

+45
-20
lines changed

frontend/docs/docs/user-guide/collection.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Intro to Collections
22

3-
A collection is a specific, user-directed grouping of archived items from either crawls or WACZ files. You can create a collection, add content to your collection, include a description to your collection, download your collection, and share your collection whichever way you need to others in your community.
3+
A collection is a specific, user-directed grouping of either crawls or uploaded WACZ files, both [archived items](./archived-items.md). You can create a collection, add content to your collection, include a description to your collection, download your collection, and share your collection whichever way you need to others in your community.
44

55
## Create a Collection
66

frontend/docs/docs/user-guide/deduplication.md

Lines changed: 44 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -2,33 +2,21 @@
22

33
## Overview
44

5-
Deduplication (or “dedupe”) is the process of preventing duplicate content from being stored during crawling. When deduplication is enabled, the crawler will reference a collection’s existing items when checking for new content and URLs. Content that is identical, even when found at a different URL, will be deduplicated by writing "revisit" records rather than the full resource in the resulting crawl WACZ files. This results in a smaller, space-saving collection and smaller archived items.
5+
Deduplication (or “dedupe”) is the process of preventing duplicate content from being stored during crawling. In Browsertrix, deduplication is faciliated through [collections](./collection.md), which allow arbitrary grouping of crawled content as needed.
66

7-
Deduplication in Browsertrix is facilitated by a _deduplication index_ associated with a given collection, which contains information for every resource and URL in the collection’s archived items.
7+
After deduplication is enabled on a collection, a _deduplication index_ is created for the collection, containing all the unique content hashes for every URL in every archived item in the collection.
88

9-
## Tradeoffs and Considerations
10-
11-
While deduplication can help save storage space, the process also creates dependencies between items. Because content for a given page may be spread throughout multiple crawls, one crawled item may depend on another for its skipped and omitted content. To view the complete, deduplicated content of a crawled site, the entire collection must be replayed or downloaded.
12-
13-
For individual crawls, replay may not work as expected after downloading the crawl unless you select the option to download the crawl with dependencies. This option will bundle the crawl WACZ with all of the WACZ files from other archived items in the collection that the crawl’s deduplicated resources depend on as a single combined WACZ file.
14-
15-
Because content for a given page may be spread throughout multiple crawls, deleting crawls in a deduplicated collection will also result in replay not working as expected for some pages and resources.
9+
When running a crawl with deduplication, the crawler will check a designated collection's deduplication index for each new URL discovered during the crawl. Content that is identical, even when found at a different URL, will be deduplicated by writing "revisit" records rather than the full resource in the resulting crawl WACZ files. This results in a smaller, space-saving WACZ files and smaller collections and crawls.
1610

17-
Deduplication may be more appropriate for some users of Browsertrix than others. It might be a good fit for you if you:
18-
19-
- Keep your web archives primarily within Browsertrix, utilizing the collection sharing features to provide access; or if you
20-
- Regularly export your archived items from Browsertrix and add their contents to playback systems where they are all replayed together, such as large web archive collections powered by wayback machine-style software such as [pywb](https://github.com/webrecorder/pywb).
11+
## Enable Deduplication
2112

22-
Deduplication might not be the best fit for you if you:
13+
Deduplication can be enabled in two ways, from a crawl workflow, or from an existing collection.
2314

24-
- Regularly download your archived items as WACZ files to store and share as discrete files that will be replayed independently from each other, such as in a digital preservation repository or digital library that uses ReplayWeb.page as a web archive viewer. You may find that the need to download each crawl as a combined WACZ file with all dependencies from other items included for replay outside of Browsertrix negates the storage savings that would otherwise be gained from using deduplication.
2515

26-
!!! tip "Tip: Removing Items from a Collection with Deduplication"
27-
Crawls in a deduplicated collection that are deleted will not have the URLs encountered during those crawls removed from the collection’s deduplication index by default. Org admins are able to prune a collection’s deduplication index down to only its current items by clicking _Purge Index_ in the **Deduplication** tab of the collection, or in the **Deduplication** section of [Org Settings](./org-settings.md). This will start a new job to rebuild the index without the removed items.
16+
_From Crawl Workflow_: To enable deduplication, specify a collection to use as the deduplication source when creating or editing a workflow, from the workflows **Deduplication** section. The first time a crawl workflow is run with deduplication enabled, a deduplication index will be created for the collection and available to view in the collection’s **Deduplication** tab. The workflow UI allows providing a new collection that will be created for deduplication once the crawl workflow is saved and before the crawl starts.
2817

29-
## Enable Deduplication
3018

31-
To enable deduplication, specify a collection to use as the deduplication source when creating or editing a workflow. The first time a crawl workflow is run with deduplication enabled, a deduplication index will be created for the collection and available to view in the collection’s **Deduplication** tab. It is also possible to create the deduplication index of the collection’s archived items before running crawl workflows from the collection’s **Deduplication** tab by selecting the _Create Dedupe Index_ button, which is visible if the index does not yet exist.
19+
_From Collection_: It is also possible to create the deduplication index of the collection’s archived items before running crawl workflows from the collection’s **Deduplication** tab by selecting the _Create Dedupe Index_ button, which is visible if the index does not yet exist. This is useful for creating an index on an existing collection.
3220

3321
Building the deduplication index may take some time, especially for collections that already contain a large number of archived items, as Browsertrix will index each URL from all items in the collection.
3422

@@ -48,6 +36,43 @@ If you already have existing archived items and crawl workflows in your org, be
4836

4937
Finally, have an org admin set the collection as the default deduplication source in the **Crawling Defaults** section of [Org Settings](./org-settings.md). Once this is done, all new crawl workflows will have deduplication enabled, and all crawls run with these workflows will be deduplicated against the collection containing all of the items in your org.
5038

39+
## Tradeoffs and Considerations
40+
41+
### Dependencies between Crawls
42+
43+
While deduplication can help save storage space, the process also creates dependencies between different crawls. Without deduplication, each crawl's WACZ files are independent from any other crawls. With deduplication, since the previously crawled content for a given page may be spread throughout multiple crawls, a crawl may depend on WACZ files in one or more previous crawls for its skipped and omitted content.
44+
45+
Browsertrix tracks the crawl dependencies on the collection's **Deduplication** section, and all dependencies / previous crawls are also added to the same collection.
46+
47+
To view the complete, deduplicated content of a crawled site, more than one archived item may need to be loaded.
48+
By default, the collection should aready have all the dependencies needed for replay (unless manually removed).
49+
50+
For individual crawls, Browsertrix will automatically pull in the required dependent crawls to make replay work.
51+
52+
### Downloading Deduplicated Crawls
53+
54+
Downloading individual WACZ files for deduplicated crawls only includes incremental, new data (as that is the intention for deduplication, after all!), assuming that the user already has the duplicate data elsewhere.
55+
56+
To ensure all necessary data for replay is included, be sure to select **Export as Combined WACZ With Dependencies** from the WACZ files page. This option will bundle the crawl WACZ with any additional WACZ files from other archived items in the collection that the crawl’s deduplicated resources depend on as a single combined WACZ file.
57+
58+
### Deleting Deduplicated Crawls
59+
60+
Because content for a given page may be spread throughout multiple crawls, deleting crawls in a deduplicated collection will also result in replay not working as expected for some pages and resources.
61+
62+
!!! tip "Tip: Deleting Items from a Collection with Deduplication"
63+
Crawls that are deleted or removed from a collection with deduplication enabled are not automatically removed from the collection’s deduplication index. This allows for future crawls to still deduplicate against the index without having to store the full crawl data in Browsertrix. This may be the desired behavior for incremental crawling. Org admins are able to prune a collection’s deduplication index down to only its current items by clicking _Purge Index_ in the **Deduplication** tab of the collection, or in the **Deduplication** section of [Org Settings](./org-settings.md). This will start a new job to rebuild the index without the removed items.
64+
65+
### Deduplication Use Cases
66+
67+
Deduplication may be more appropriate for some users of Browsertrix than others. It might be a good fit for you if you:
68+
69+
- Keep your web archives primarily within Browsertrix, utilizing the collection sharing features to provide access; or if you
70+
- Regularly export your archived items from Browsertrix and add their contents to playback systems where they are all replayed together, such as large web archive collections powered by wayback machine-style software such as [pywb](https://github.com/webrecorder/pywb).
71+
72+
Deduplication might not be the best fit for you if you:
73+
74+
- Regularly download your archived items as WACZ single files to store and share as discrete files that will be replayed independently from each other, such as in a digital preservation repository or digital library that uses ReplayWeb.page as a web archive viewer. You may find that the need to download each crawl as a combined WACZ file with all dependencies from other items included for replay outside of Browsertrix negates the storage savings that would otherwise be gained from using deduplication.
75+
5176
## Technical Details
5277

5378
More information about how deduplication is implemented in Browsertrix Crawler is available in the [crawler documentation](https://crawler.docs.browsertrix.com/user-guide/dedupe/).

0 commit comments

Comments
 (0)