OpenTermsArchive
diff --git a/‎content/explanations/community/federation-benefits.md‎
Lines changed: 20 additions & 1 deletion b/‎content/explanations/community/federation-benefits.md‎
Lines changed: 20 additions & 1 deletion
diff --git a/‎content/explanations/design-principles.md‎
Lines changed: 52 additions & 1 deletion b/‎content/explanations/design-principles.md‎
Lines changed: 52 additions & 1 deletion
diff --git a/‎content/explanations/main-concepts.md‎
Lines changed: 77 additions & 1 deletion b/‎content/explanations/main-concepts.md‎
Lines changed: 77 additions & 1 deletion
diff --git a/‎content/explanations/terms-tracking/declarations-maintenance.md‎
Lines changed: 84 additions & 0 deletions b/‎content/explanations/terms-tracking/declarations-maintenance.md‎
Lines changed: 84 additions & 0 deletions
diff --git a/‎content/explanations/terms-tracking/range-selectors.md‎
Lines changed: 62 additions & 1 deletion b/‎content/explanations/terms-tracking/range-selectors.md‎
Lines changed: 62 additions & 1 deletion
diff --git a/‎content/explanations/terms-tracking/service-name-noise copy.md‎
Lines changed: 0 additions & 6 deletions b/‎content/explanations/terms-tracking/service-name-noise copy.md‎
Lines changed: 0 additions & 6 deletions
diff --git a/‎content/explanations/terms-tracking/service-name-noise.md‎
Lines changed: 35 additions & 1 deletion b/‎content/explanations/terms-tracking/service-name-noise.md‎
Lines changed: 35 additions & 1 deletion
@@ -3,4 +3,23 @@ title: Federation benefits
 weight: 1
 ---
 
-# Federation Benefits
+# Open Terms Archive federation
+
+Open Terms Archive is a decentralised system. It aims at enabling any entity set up their own collections and track terms on their own.
+
+In order to maximise **discoverability**, **collaboration** and **political power**, public collections are federated within a single ecosystem. This makes their data mutually discoverable and enables mutualising effort.
+
+## Benefits of joining the federation
+
+A collection that joins the federation enjoys the following benefits:
+
+1. Visibility on the Open Terms Archive website lists of collections and datasets.
+2. Access to the Open Terms Archive GitHub organisation, administered by the Open Terms Archive core team.
+3. Collection logo provided by the Open Terms Archive core team.
+4. Referencing in the official [collections list](https://opentermsarchive.org/collections.json), enabling off-the-shelf discovery in the [Federation API]({{< relref "/api/node" >}}).
+5. Referencing in the official [datasets list](https://opentermsarchive.org/datasets), providing visibility to analysts.
+6. Dedicated channel on the Open Terms Archive instant messaging system.
+7. API uptime tracking.
+8. Public announcement through all Open Terms Archive communication channels upon joining.
+
+By joining the Open Terms Archive federation, your collection becomes part of a **dynamic network** that leverages **shared resources** and **collective visibility** to drive **greater impact**.
@@ -1,6 +1,57 @@
 ---
-title: Design principles
+title: "Design principles"
 weight: 2
+aliases: /design-principles/
 ---
 
 # Design principles
+
+These overarching principles guide technical and governance decisions. They are fundamental and can only be changed through community consensus, based on a thorough impact assessment.
+
+Each principle has a name, a rationale, and potential implementation examples or guidelines.
+
+## 1. Never trust the services
+
+A major goal of Open Terms Archive is to enable assessing the loyalty of services towards their end users. Since loyalty is not assumed, trust can not be warranted.
+
+### Cases
+
+Several services have been observed:
+
+- blocking an IP or a user agent randomly;
+- pretending to encounter technical errors (`500`, `502`…) instead of being explicit about their intention (`robots.txt`, `403`…);
+- to not reflect actual updates in the “last update” date of their contractual documents;
+- changing the content of the same page based on user agent properties or source IP geolocation. When one accesses a supposedly already regionalized policy according to the URL, but gets a different content based on geolocation without any information nor ability to access other regional policies, we [consider](https://github.com/OpenTermsArchive/docs/pull/43#discussion_r1252232131) it misleading and disloyal.
+
+### Examples of consequential choices
+
+- Do not use “last update” date in documents or headers for metadata.
+
+## 2. Do not require trust in maintainers
+
+Open Terms Archive maintainers should not need to be trusted by users more than the services it enables assessing.
+
+### Cases
+
+- Collections can be unmaintained.
+- Maintainers can filter out content that could be relevant from the perspective of other maintainers.
+- A server can encounter technical problems and miss updates.
+
+### Examples of consequential choices
+
+- Always keep an untouched snapshot of the source documents.
+- Use cryptographic signatures to ensure the database can be authenticated.
+- Enable terms collection to be replicated by anyone.
+- Support duplication across collections as this increases the resilience of the network. It will be up to reusers to decide which source they prefer in case of divergence.
+
+## 3. Obtain documents like a user would
+
+In order to guarantee legal relevance, source documents should only be ones that end users of the service are themselves receiving. Following principle 1, technical workarounds to obtain some version of the source documents that are more easily handled by machines cannot be trusted to have the same content as the ones intended for end users.
+
+### Cases
+
+- Accessing the same URL from a differently geolocated IP address will change the contents in some services.
+
+### Examples of consequential choices
+
+- Scrape HTML even if one could obtain the contractual content from an API.
@@ -3,4 +3,80 @@ title: Main concepts
 weight: 1
 ---
 
-# Main concepts
+## Main concepts
+
+Words in bold are [business domain names](https://en.wikipedia.org/wiki/Domain-driven_design).
+
+**Services** have **terms** written in **documents**, contractual (Terms of Services, Privacy Policy…) or not (Community Guidelines, Deceased User Policy…), that can change over time. Open Terms Archive enables users rights advocates, regulatory bodies and interested citizens to follow the **changes** to these **terms**, to be notified whenever a new **version** is published, to explore their entire **history** and to collaborate in analysing them. This free and open-source engine is developed to support these goals.
+
+### Collection
+
+Open Terms Archive is a decentralised system. It aims at enabling any entity to **track** **terms** on its own. To that end, the Open Terms Archive **engine** can be run on any server, thus making it a dedicated **instance**. An **instance** **tracks** **terms** within a single **collection**.
+
+A **collection** is characterised by a **scope** across **dimensions** that describe the **terms** it **tracks**, such as **language**, **jurisdiction** and **industry**.
+
+#### Example scope
+
+> The terms tracked in this collection are:
+
+> - Of dating services used in Europe.
+> - In the European Union and Switzerland jurisdictions.
+> - In English, unless no English version exists, in which case the primary official language of the jurisdiction of incorporation of the service operator will be used.
+
+### Federation
+
+In order to maximise discoverability, collaboration and political power, public **collections** are **federated** within a single ecosystem. This makes their data mutually discoverable and enables mutualising effort.
+
+### Terms types
+
+To distinguish between the different **terms** of a **service**, each has a **type**, such as “Terms of Service”, “Privacy Policy”, “Developer Agreement”…
+
+This **type** matches the topic, but not necessarily the title the **service** gives to it. Unifying the **types** enables comparing **terms** across **services**.
+
+> More information on terms types can be found in the [dedicated repository](https://github.com/OpenTermsArchive/terms-types). They are published on NPM under [`@opentermsarchive/terms-types`](https://www.npmjs.com/package/@opentermsarchive/terms-types), enabling standardisation and interoperability beyond the Open Terms Archive engine.
+
+### Declarations
+
+The **terms** that constitute a **collection** are defined in simple JSON files called **declarations**.
+
+A **declaration** also contains some metadata on the **service** on which the **terms** apply.
+
+> Here is an example declaration tracking the Privacy Policy of Open Terms Archive:
+>
+> ```json
+> {
+>   "name": "Open Terms Archive",
+>   "terms": {
+>     "Privacy Policy": {
+>       "fetch": "https://opentermsarchive.org/en/privacy-policy",
+>       "select": ".textcontent"
+>     }
+>   }
+> }
+> ```
+
+- - -
+
+## Add terms to a collection
+
+Open Terms Archive **acquires** **terms** to deliver an explorable **history** of **changes**. This can be done in two ways:
+
+1. For the present and future, by **tracking**.
+2. For the past, by **importing** from an existing **fonds** such as [ToSBack](https://tosback.org), the [Internet Archive](https://web.archive.org/), [Common Crawl](https://commoncrawl.org) or any other in-house format.
+
+### Tracking terms
+
+In order to **track** the **changes** of **terms**, the **engine** **records** a **snapshot** of **documents** that contain them by **fetching** their web **location** several times a day. The **engine** then **extracts** a **version** from this **snapshot** by:
+
+1. **Selecting** the subset of the **document** (or **documents**) that contains the **terms** (instead of, e.g., navigation menus, footers, cookies banners…).
+2. **Removing insignificant content**, that is residual content in this subset that is not part of the **terms** (e.g. ads, illustrative pictures, internal navigation links…).
+3. **Filtering noise** that can emerge in the **terms** by preventing parts that change frequently from triggering false positives for **changes** (e.g. tracker identifiers in links, relative dates…). The **engine** can execute custom **filters** written in JavaScript to that end.
+
+After these steps, if **changes** are spotted in the resulting **terms**, a new **version** is **recorded**.
+
+Preserving **snapshots** enables recovering after the fact information potentially lost in the **extraction** step: if **declarations** were wrong, they can be **maintained** and corrected **versions** can be **extracted** from the original **snapshots**.
+
+### Importing terms
+
+Existing **fonds** can be prepared for easier analysis by unifying their format to the **Open Terms Archive dataset format**. This unique format enables building interoperable tools, fostering collaboration across reusers.
+Such a dataset can be generated from **versions** alone. If **snapshots** and **declarations** can be retrieved from the **fonds** too, then a full-fledged **collection** can be created.
@@ -4,3 +4,87 @@ weight: 1
 ---
 
 # Declarations maintenance
+
+All parts of a **terms** **declaration** (web location, selection, noise removal, distribution across multiple documents…) can change over time. The process of updating these elements to enable continued **tracking** is called **maintenance**. Without it, **terms** can become:
+
+- **unreachable**: no **snapshot** can be **recorded** at all, because the **location** changed or the **service** denies access;
+- **unextractable**: no **version** can be **extracted** from the **snapshot**, because the selection of content or some **filter** fails;
+- **noisy**: both **snapshots** and **versions** are **recorded** but the **changes** contain **noise** that should have been **filtered out**.
+
+Open Terms Archive needs to keep track of this changes in order to regenerate versions history from snapshots history.
+
+## Service history reference
+
+To keep track of services declarations and filters changes, Open Terms Archive offers a versioning system. It is optional and should be added only when needed. It works by creating history files for terms declarations and filters, where each entry should be a previous valid declaration or filter function and should have an expiry date.
+
+Both for terms and filters history, the expiration date is declared in a property `validUntil`. It should be the authored date and time of the last snapshot commit for which the declaration is still valid.
+
+Terms declarations history files and filters history files can both evolve on their own. Having one does not imply to create the other.
+
+The current (latest) valid declaration has no date and should not appear in the history object: it stays in its own file, just like if there was no history at all.
+
+### Terms declaration history
+
+Declarations history are stored in a history JSON file with the following name `declarations/$service_id.history.json`.
+
+The terms history contains an object with terms types as properties. Each terms type property is an array of history entries. Each entry has the same format as a normal terms declaration, except there is the **mandatory** extra property `validUntil`.
+
+```json
+{
+  …
+  "<terms type>": [
+    {
+      "fetch": "The URL where the document can be found",
+      "executeClientScripts": "A boolean to execute client-side JavaScript loaded by the document before accessing the content, in case the DOM modifications are needed to access the content; defaults to false (fetch HTML only)",
+      "filter": "An array of service specific filter function names",
+      "remove": "A CSS selector, a range selector or an array of selectors that target the insignificant parts of the document that has to be removed. Useful to remove parts that are inside the selected parts",
+      "select": "A CSS selector, a range selector or an array of selectors that target the meaningful parts of the document, excluding elements such as headers, footers and navigation",
+      "validUntil": "The inclusive expiration date in ISO format"
+    }
+  ]
+  …
+}
+```
+
+For example, to add a history entry for the `Terms of Service` of the service `ASKfm`, create the file `declarations/ASKfm.history.json` with the following contents:
+
+```json
+{
+  "Terms of Service": [
+    {
+      "fetch": "https://ask.fm/docs/terms_of_use/?lang=en",
+      "select": "body",
+      "filter": ["add"],
+      "validUntil": "2020-10-29T21:30:00.000Z"
+    }
+  ]
+}
+```
+
+### Filters history
+
+Filters history is declared in a filters history declaration JavaScript file with the following name: `declarations/$service_id.filters.history.js`.
+
+For each filter, a variable named like the filter must be exported. This variable should contain an array of filter history entries. Each entry is an object with the expiration date, as `validUntil` property, and the valid function for this date, under the `filter` property. Both properties are **mandatory**.
+
+```js
+export const <filterName> = [
+  {
+    validUntil: "The inclusive expiration date in ISO format",
+    filter: function() { /* body valid until the expiration of the `validUntil` date */ }
+  }
+];
+```
+
+For example, to add a history entry for the `removeSharesButton` filter of the service `ASKfm`, create the file `declarations/ASKfm.filters.history.js` with the following content:
+
+```js
+export const removeSharesButton = [
+  {
+    validUntil: '2020-08-22T11:30:21.000Z',
+    filter: async (document) => {
+      document.querySelectorAll('.shares').forEach((element) => element.remove());
+    },
+  },
+];
+```
@@ -3,4 +3,65 @@ title: Range selectors
 weight: 2
 ---
 
-# Range selectors
+## Range selectors
+
+When no unique wrapper element exists for the whole terms content, there is no easy way to select the content with only CSS selectors. Content between two elements in a document can be selected using a range selector, regardless of their DOM position. The concept is inspired by the DOM [Range API](https://developer.mozilla.org/en-US/docs/Web/API/Range), where content is defined by start and end points that may be included or excluded. The format is defined as a JSON object:
+
+```json
+{
+  "start[Before|After]": "CSS selector that marks where to begin capturing content",
+  "end[Before|After]": "CSS selector that marks where to stop capturing content"
+}
+```
+
+### Example
+
+Let's take an example to see when range selectors can be useful. Given the following HTML:
+
+```html
+<html>
+
+<body>
+  <main>
+    <!-- Breadcrumb Navigation -->
+    <ul>
+      <li><a href="/">Home</a></li>
+      <li>Terms and Conditions</li>
+    </ul>
+
+    <!-- Main Content -->
+    <h1 id="terms-title">Example Terms</h1>
+    <p>Effective as of: January 1, 2024</p>
+
+    <h2>Authorized uses</h2>
+    <p>You can use this service in the following cases:</p>
+
+    <ul>
+      <li>At home</li>
+      <li>In your office</li>
+      <li>In a coffee shop</li>
+    </ul>
+  </main>
+  <div>
+    <ul id="footer-menu">
+      <li><a href="/about">About</a></li>
+      <li><a href="/contact">Contact</a></li>
+    </ul>
+  </div>
+</body>
+
+</html>
+```
+
+In this case, there is no unique wrapper element for the terms content which is represented by all elements after the main title in the `main` element. Here selecting the whole `main` would result in selecting elements that are not part of the terms content, like the breadcrumb and sub navigation. The range selector can be used to select the terms content by specifying the main title `#terms-title` as the start point and the footer `#footer-menu` as the end point. The selection starts *before* the main title, so it includes it, and ends *before* the footer, so it excludes it.
+
+So the resulting range selector is:
+
+```json
+{
+  "startBefore": "#terms-title",
+  "endBefore": "#footer-menu"
+}
+```
+
+This range selector will select the terms content between the main title and the footer element.
@@ -3,4 +3,38 @@ title: Service naming
 weight: 5
 ---
 
-# Service name usual noise
+## Service name
+
+### Casing
+
+- In order to find the service name casing, rely first on the page title (easily found in search results). Do not rely on the logo as it can be stylized differently. Example with Facebook:
+![facebook search](https://user-images.githubusercontent.com/222463/91416484-baaa3a00-e84f-11ea-94cf-8805d17aa711.png)
+- If it is still ambiguous, rely on Wikipedia as a source. However, make sure to differentiate the _service_ from the _provider_ company's name. Example with “DeviantArt”, a service (which used to be stylized deviantArt until 2014) by the limited liability company “deviantArt”:
+![deviantArt search](https://user-images.githubusercontent.com/222463/91416936-5b98f500-e850-11ea-80fe-a50be27356e3.png)
+
+### Terms used by several services
+
+- If you want to add terms which happen to be shared with another service from the same parent company, be specific in naming the exact service you want to track. For instance, you may find that a company like Github uses the same terms for its code hosting and its AI assistant. While this does not mean that the terms for GitHub (code hosting) are the only terms of GitHub Copilot (assistant), it does mean that these two services have terms that are represented in the same document. In tracking terms for one of these services, say Github Copilot, be specific in naming it as the service you want to track. This way, if GitHub was to introduce dedicated terms for each of these services in the future, their locations can be updated without having to create new terms since the service already existed before.
+
+- - -
+
+## Service ID
+
+### Normalisation
+
+1. For non-roman alphabets (Cyrillic, ideograms…), use the service-provided transliteration.
+2. For diacritics: normalise the string to its `NFD` normal UTF form, then remove the entire combining character class. [Details](https://stackoverflow.com/a/37511463/594053).
+3. As a last resort, use the domain name.
+
+### Provider prefixing
+
+- If you encounter terms you want to add to a service, yet find that it would override already-declared terms for this service such as Terms of Service or Privacy Policy, and that the only solution you see would be to create a new terms type that would contain the name of the feature, then it is likely you should declare a new service, potentially duplicating existing terms.
+
+> Example: the Facebook Community Payments terms are Terms of Service. The only way to declare them in the Facebook service would be to add a “Community Payments Terms” terms type as they would otherwise conflict with Facebook's Terms of Service. It is better to declare a new service called “Facebook Payments” with its own Terms of Service. It turns out that this service also has a developer agreement, independent from the main Facebook service.
+
+![Facebook Community Payments](https://user-images.githubusercontent.com/222463/91419033-3a85d380-e853-11ea-8468-42a536b7e87b.png)
+
+- As a last resort, rely on the trademark.
+
+Example: Apple's App Store uses only generic terms (“app” and “store”). However, it is of common use to mention “the App Store” as Apple's. To help us decide whether it should be prefixed or not, we can check that [Apple has trademarked “App Store”](https://www.apple.com/legal/intellectual-property/trademark/appletmlist.html). The service can thus be named “App Store”, without prefixing.
+