Skip to content

Commit a2b9ff2

Browse files
SuaYooemma-sgtw4l
authored
docs: Update browser profile user guide (#3034)
- Updates browser profiles user guide and separates into distinct pages for overview, creating, and editing. - Fixes workflow settings "Proxy" label to match UI. - Adjusts mkdocs heading styling. --------- Co-authored-by: Emma Segal-Grossman <[email protected]> Co-authored-by: Tessa Walsh <[email protected]>
1 parent 5e493a0 commit a2b9ff2

File tree

12 files changed

+226
-73
lines changed

12 files changed

+226
-73
lines changed

frontend/docs/docs/deploy/proxies.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
Browsertrix can be configured to direct crawling traffic through dedicated proxy servers, allowing websites to be crawled from a specific geographic location regardless of where Browsertrix itself is deployed.
44

5-
The Browsertrix superadmin can configure which proxy servers are available to which organizations or if they are shared across all organizations, and users can [choose from one of the available proxies in each crawl workflow](../user-guide/workflow-setup.md#proxy). Users can also configure the default crawling proxy that will be used for the organization in organization-wide [Crawling Defaults](../user-guide/org-settings.md#crawling-defaults).
5+
The Browsertrix superadmin can configure which proxy servers are available to which organizations or if they are shared across all organizations, and users can [choose from one of the available proxies in each crawl workflow](../user-guide/workflow-setup.md#crawler-proxy-server). Users can also configure the default crawling proxy that will be used for the organization in organization-wide [Crawling Defaults](../user-guide/org-settings.md#crawling-defaults).
66

77
This guide covers how to set up proxy servers for use with Browsertrix, as well as how to configure Browsertrix to make those proxies available to organizations.
88

Lines changed: 3 additions & 0 deletions
Loading

frontend/docs/docs/stylesheets/extra.css

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -134,12 +134,22 @@ h5 {
134134
}
135135

136136
.md-typeset h1,
137-
h2,
138-
h3 {
139-
font-weight: 650 !important;
137+
.md-typeset h2,
138+
.md-typeset h3,
139+
.md-typeset h4 {
140+
font-weight: 600;
140141
font-variation-settings: "OPSZ" 35;
141142
}
142143

144+
.md-typeset h3 {
145+
font-size: 1.125em;
146+
}
147+
148+
.md-typeset h4,
149+
.md-typeset h5 {
150+
font-size: 0.875em;
151+
}
152+
143153
.md-typeset {
144154
font-feature-settings:
145155
"ss04" off,

frontend/docs/docs/user-guide/browser-profiles.md

Lines changed: 0 additions & 54 deletions
This file was deleted.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Intro to Browser Profiles
2+
3+
Browser profiles are saved instances of a web browsing session that can be used to configure a website before it is crawled.
4+
5+
## Common Use Cases
6+
7+
### Social Media Sign In
8+
9+
Pre-configure a social media site to be logged in so that the crawler can access content that can only be viewed by logged-in users.
10+
11+
!!! tip "Best Practices: Use an account created specifically for archiving a website"
12+
13+
We recommend creating dedicated accounts for archiving pages that require user registration but are otherwise public. This practice is sometimes referred to as a login wall. Login walls are commonly used by social media platforms.
14+
15+
Although dedicated accounts are not required to benefit from browser profiles, they can address the following potential issues:
16+
17+
- While usernames and passwords are never saved by Browsertrix, the private tokens that enable access to logged in content _are_ stored. Thus, anyone with access to your Browsertrix account, intentional or malicious, may be able to access the logged in content.
18+
19+
- Some websites may rate limit or lock accounts for reasons they deem to be suspicious, such as logging in from a new geographical location or if the site determines crawls to be robot activity.
20+
21+
- Personalized data such as cookies, location, etc. may be included in the resulting crawl.
22+
23+
- The logged in interface may display unwanted personally identifiable information such as a username or profile picture.
24+
25+
An exception to this practice is if your goal is to archive personalized or private content accessible only from designated accounts. In these instances we recommend changing the account's password after crawling is complete.
26+
27+
### Hide Popup Prompts
28+
29+
Websites may prompt users for a number of reasons before displaying the rest of the page, such as for age verification, informed consent requirements, or geographical location. Configure a browser profile to accept, dismiss, or otherwise hide these dialogs so that the content behind them is visible to the crawler.
Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Configure Sites
2+
3+
Websites are configured through a temporary browser that is embedded directly in the Browsertrix interface. Every website that is visited using the embedded browser is added to the list of _Saved Sites_. When the embedded browser session ends, personalized data from the sites are collected into a profile. This profile of preconfigured sites can then be saved and used by multiple [crawl workflows](../crawl-workflows.md).
4+
5+
The embedded browser is used during the process of [creating a new browser profile](./create-browser-profile.md) and when [editing an existing profile](./edit-browser-profile.md).
6+
7+
## Use Cases
8+
9+
### Website Sign In
10+
11+
To crawl content as a logged in user, load the website you intend to archive in the embedded browser and sign in as you would on any other browser. Once the account has been logged in, confirm by accessing a page on the site that the crawler should have access to. You may need to periodically log in again as websites may log users out after a certain period of time.
12+
13+
!!! tip "Tip: Crawl regularly to stay logged in"
14+
15+
Regularly running crawl workflows that use a browser profile can help to reduce the frequency with which logouts occur on some websites. Data such as cookies and sessions may be refreshed during crawling, and Browsertrix will automatically update the browser profile with this data when each crawl successfully finishes.
16+
17+
### Hide Popups
18+
19+
Load the website you intend to archive in the embedded browser and accept or otherwise dismiss the prompt. If the developers of the website have built the site in such a way that the result of your interaction is saved, the popup should remain hidden at crawl time. This can be confirmed by exiting the embedded browser session and then loading the site again.
20+
21+
### Customize the Crawling Browser
22+
23+
The embedded browser used to configure profiles is the same browser behind Browsertrix’s high-fidelity crawls. This enables advanced use cases like using a browser profile to customize the browser at crawl time. To view all available browser settings, load any site in the profile and then navigate to `brave://settings` in the embedded browser.
24+
25+
!!! info "Advanced Use Case: Proceed with caution"
26+
Customizing the crawler browser is for advanced use cases and it is not generally recommended to change these settings. We offer crawl-time browser customization like [ad blocking](../workflow-setup.md#block-ads-by-domain) and [language](../workflow-setup.md#language) in workflow settings. Changing browser settings directly in the profile may result in conflicting settings that are difficult to troubleshoot. If using this advanced feature, we recommend [adding clear metadata](./edit-browser-profile.md#edit-browser-profile-metadata) to the browser profile that describes the change.
27+
28+
??? example "Example: Blocking page resources with Brave's Shields"
29+
Whereas the crawler's scoping settings can be used to define which pages should be crawled, Brave's [Shields](https://brave.com/shields/) feature can block resources on pages from being loaded. By default, Shields will block [EasyList's cookie list](https://easylist.to/) but it can be set to block a number of other included lists under Brave `Settings > Shields > Filter Lists`.
30+
31+
_Custom Filters_ can also be useful for blocking sites with resources that aren't included in one of the known block lists.
32+
33+
The [uBlock Origin filter syntax](https://github.com/gorhill/uBlock/wiki/Static-filter-syntax) can be used for more specificity over what in-page resources should be blocked.
34+
35+
All of the browser's ad blocking and privacy features can be used in combination with the [_Block Ads by Domain_](../workflow-setup.md#block-ads-by-domain) crawler setting.
36+
37+
## Saving the Profile
38+
39+
After you are done interacting with the embedded browser, press _Save Profile_ (or _Create Profile_ for new browser profiles.)
40+
41+
## Saved Sites
42+
43+
All sites that are loaded in the embedded browser and then saved will appear in the _Saved Sites_ list. Select a site in the list to view or reconfigure the site in the embedded browser.
44+
45+
## Load New URL
46+
47+
You may want to load a URL that is not listed in the _Saved Sites_ to preview how a page may appear to the crawler, or to add a new site. Due to the nature of the embedded browser, it can be difficult to navigate between different websites if there are no hyperlinks between them. The easiest way to load a new URL is to press _Load New URL_ from the browser profile page and enter the URL.
48+
49+
Although browser profiles have no limit on the number of saved sites, we recommend one site per browser profile to make troubleshooting crawls easier. An exception is when using a [URL List](../workflow-setup.md#list-of-pages) workflow to crawl multiple websites that require a profile, as we only allow one browser profile per workflow.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
# Create a New Browser Profile
2+
3+
To create a new browser profile, press _New Browser Profile_ on the **Browser Profiles** page.
4+
5+
## New Browser Profile Settings
6+
7+
: ### Primary Site URL
8+
9+
: The URL of the first page to visit in the embedded browser. For example, the login page for a social media website.
10+
11+
: ### Profile Name
12+
13+
: A custom name for the browser profile. The domain name of the _Primary Site URL_ will be used if this field is left blank.
14+
15+
Depending on your organization settings, additional settings may be available:
16+
17+
: ### Proxy Server
18+
19+
: The proxy server to be used by the embedded browser as well as any crawl that uses this profile.
20+
21+
!!! tip "Implication for crawl workflows using proxies"
22+
23+
When a browser profile is added to a crawl workflow, the browser profile’s proxy setting will take precedence over the crawl workflow’s [_Crawler Proxy Server_](../workflow-setup.md#crawler-proxy-server) setting. This prevents potential crawl failures that result from conflicting proxies.
24+
25+
: ### Crawler Release Channel
26+
27+
: For advanced use cases, you can specify a [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) release that contains another version of the embedded browser, such as a beta version that may contain experimental features.
28+
29+
Press _Start Browser_ to load the temporary embedded browser used to configure sites. It may take a few moments for the embedded browser to load. The browser profile will not be saved until _Create Profile_ is pressed.
Lines changed: 59 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,59 @@
1+
# Edit Browser Profile
2+
3+
Sometimes websites will log users out by expiring cookies or login sessions after a period of time. Although running crawls on a regular basis can help keep websites logged in for longer, most websites will log users out after an extended period of time, be it a week or 30 days. In this case, the browser profile may not behave as expected at crawl time and will need to be reconfigured.
4+
5+
To check or update the profile, go to the browser profile page and select _Load Profile_ from the action menu.
6+
7+
!!! tip "Tip: Fail crawls early to identify logged-out profiles"
8+
9+
Enabling [_Fail Crawl If Not Logged In_](../workflow-setup.md#fail-crawl-if-not-logged-in) on a workflow can help identify which profiles need attention and prevent adding unwanted logged-out content to a collection.
10+
11+
## Load / Edit Profile Settings
12+
13+
: ### Primary Site
14+
15+
: The primary site that is configured in the browser profile. A new primary site can be configured by choosing _New Site_ in the dropdown.
16+
17+
: If the browser profile is in use by crawl workflows with a [crawl start URL](../workflow-setup.md#crawl-start-url-urls-to-crawl) that is not a saved site, a section titled _Suggestions from Related Workflows_ with suggested options will be displayed in the dropdown.
18+
19+
: ### URL to Load
20+
21+
: The URL of the first page in the primary site to visit in the embedded browser. For example, the login page for a social media website.
22+
23+
: ### Reset previous configuration on save
24+
25+
: If checked, all previously saved sites and their associated data will be removed from the browser profile. If your organization supports proxies, this will also enable choosing a different proxy server.
26+
27+
Depending on your organization settings, additional settings may be available:
28+
29+
: ### Proxy Server
30+
31+
: The proxy server to be used by the embedded browser as well as any crawl that uses this profile.
32+
33+
??? tip "Implication for crawl workflows using proxies"
34+
35+
When a browser profile is added to a crawl workflow, the browser profile’s proxy setting will take precedence over the crawl workflow’s [_Crawler Proxy Server_](../workflow-setup.md#crawler-proxy-server) setting. This prevents potential crawl failures that result from conflicting proxies.
36+
37+
: ### Crawler Release Channel
38+
39+
: For advanced use cases, you can specify a [Browsertrix Crawler](https://github.com/webrecorder/browsertrix-crawler) release that contains another version of the embedded browser, such as a beta version that may contain experimental features.
40+
41+
Press _Start Browser_ to load the temporary embedded browser. It may take a few moments for the embedded browser to load.
42+
43+
When finished, press the _Save Profile_ button to return to the profile's details page. Profiles are automatically backed up on save if replica storage locations are configured.
44+
45+
## Edit Browser Profile Metadata
46+
47+
To edit the name, description, and tags, select _Edit Metadata_ from the action menu on the browser profile page.
48+
49+
: ### Name
50+
51+
: A custom name for the browser profile.
52+
53+
: ### Description
54+
55+
: A short description of the browser profile.
56+
57+
: ### Tags
58+
59+
: Tag the browser profile with additional metadata like category or keywords. Tags are displayed on the browser profiles list page.

0 commit comments

Comments
 (0)