Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
389 changes: 369 additions & 20 deletions docs/components/index.md

Large diffs are not rendered by default.

15 changes: 15 additions & 0 deletions docs/primers/accuracy_vs_precision.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Accuracy vs Precision
=====================

When producing data, key considerations include accuracy and precision.

Accuracy is how close to the truth a given value is.

Precision is the level of detail included in the value.

For example, a boxer might step on the scales and record a value of 87kg. If the boxer’s true weight is 87.43kg, then the scales are accurate to +/- 1kg, and have a degree of precision of 1kg. If the scales were to record a weight of 98.7285kg, they would have a high degree of precision, but a low degree of accuracy.

For data standards, this concept can be applied to other concepts as well. For example, if an event is described as “Usually happening every Monday at 9am” then its degree of accuracy is relatively high (because the potential for it to not happen is described by the word “usually”), but its precision is relatively low (because it doesn’t tell us under what conditions it might not be happening). Conversely, an event that is described as happening on Monday 3rd May 2021 at 9am is very precise, but may not be accurate if the event doesn’t actually happen on bank holidays.

Conveying the level of precision in a data standard (as part of the design, or the metadata) can be important for ensuring that its accuracy is understood by data users. Typically, the more precise data needs to be, the higher the costs involved in creating it accurately.

19 changes: 19 additions & 0 deletions docs/primers/aggregators_data_stores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
Aggregators & Data Stores
=========================


Aggregators are online tools which bring together multiple sources of data, and present them to users as one complete feed. This means that someone wanting to use the data only has to connect to one source, rather than multiple, reducing the complexity of their system.

Aggregators will often keep a copy of the data that they’ve downloaded, so that if the original source encounters an outage, the data from that source is still available for users of the aggregator.

Aggregators might collect all of the data available in a domain, or only some (such as that relating to a particular audience or region).

Aggregators will sometimes carry out a degree of processing of the data. This might include:
De-duplication; identifying and removing data items that have been provided by multiple data sources. This can happen if an aggregator consumes data from other aggregators - such as one that covers a particular sport, and another that covers a particular region.
Normalization; converting data that is in multiple formats to the same format. This can happen if there isn’t a standard, or if a standard is quite loose. By normalizing in the aggregator, individual data users can receive more consistent data.
Filtering; removing data that isn’t relevant. For example, an aggregator might remove data that’s outdated or too far in the future to be relevant, or might only include data that meets certain criteria - such as a certain baseline of data quality, or use of particular fields

Aggregators may be part of, or used to provide the first part of the pipeline for, data stores.

Data stores download all of the data that’s available, and then store it in a way that’s useful for querying; this often involves considerable processing and the creation of policies to handle retention and deletion of data. Data stores can be used to understand the data at a point in time, to generate statistics about the data, and to observe how the data has changed over time (e.g. number of activities, which fields are used, how the data quality has changed).

38 changes: 38 additions & 0 deletions docs/primers/customisations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
Customisations
==============

An open data standard comprises an agreed-upon "common ground" around a particular subject or domain. This is a nuanced balance to strike - too little common ground, and the standard doesn't actually shape the data sufficiently to be used; too much and the standard is overly burdensome to use, or inappropriate for some potential users.

In practical terms, this will affect which fields are required and which are optional, what constraints are placed on the contents of fields (such as length, conformance to a particular format, or reference to an external data source), and how fields are used together. If too few fields are required, then publishers of data may not actually provide the information that users need.

A standard with too little common ground defined may also model concepts that are too abstract for their intended use case. This results in implementers having to create ways to use the standard in their own contexts, without them necessarily doing so in the same way. For example, a data standard that models lectures might not enforce using the provided way to model a course of lectures (because lectures can be standalone) - so users of that data then find that each publisher describes a course of lectures in a different way

The decisions that are made around modelling are a product of the immediate and future needs of the users of the standard - an elegant technical solution may be unworkable in practice, while a solution that's easy to publish is likely to be hard to use.

In the communities around standards, it's common to find that there are members who are more aligned with each other than others. If they work in the same sub-sector of an industry or just conceive of the domain in the same way, then it's likely that they will be able to share more information with each other, and share that information in a more aligned way. Giving these sub-communities a way to do this in a structured way, that results in useful data for all users of the standard, is something that standards approach in different ways.

## Extensions

The most formal way of making a standard customisable is to allow the creation of extensions. These are a set of technical constraints (usually schema) which can:
* Add fields
* Add additional constraints to existing fields
* Make optional fields compulsary
* Combine new and existing fields and constraints into new models, such as a more specific instance of an abstract concept.

How these extensions are governed varies, but it can include:
* "Official" extensions which are part of the standard, but only applicable in certain circumstances
* A way for a community to publish and maintain extensions, which might only be applicable to that community
* As a matter of good practice, individual publishers describing the modifications that they've made to the standard, or extra data that they've provided

Typically, extensions aren't allowed to remove fields or constraints, as this would undermine the "common ground" that can usually be assumed around a standard.

A standards initiative might create a list of known extensions and recommend their use, so that future publishers can align with existing ones when modelling the same concepts.

## Profiles

Less formal than extensions, profiles are a collection of artefacts (potentially including schema, documentation, case studies and guidance) that describe how a standard can be put to use in a particular way.

Profiles allow a group of users of a standard to describe the ways that they've resolved ambiguity or used flexibility in a standard, with the aim that others like them will follow the same approach.



6 changes: 6 additions & 0 deletions docs/primers/four_types_of_documentation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
The Four Types of Documentation
===============================

![The Four Types of Documentation](four_types_of_documentation.png)

The “four types” model helps to describe the different needs that people bring to documentation at different times. For most projects, all four types of documentation are required.
Binary file added docs/primers/four_types_of_documentation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 14 additions & 0 deletions docs/primers/linked_data_semantic_markup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Linked Data & Semantic Markup
=============================

Linked Data has evolved from the domain of knowledge management, which studies how things that are known can be organised, discovered and relationships described.

Semantic markup is the practice of including machine-readable information in web pages alongside the human-readable parts, so that the information can be used in linked data applications as well as by people.

Since the mid-2000s, this “semantic web” approach has been advocated for by many leaders in web technology, most notably Sir Tim Berners-Lee.

Semantic markup allows for high levels of automation and machine reasoning - computer systems can act in smart ways with the data that they consume. If a website presents a table of opening hours, a computer doesn’t “know” what it means - it’s just a table with some text in. With semantic markup, a computer can “know” that the text is a series of times, that those times represent when the physical place referred to by the website is open, and therefore can decide to present a warning to someone using a mapping application that the place that they’re planning a route to might be closed.

Linked Data approaches are most commonly found in contexts that are close to knowledge management, such as academia, museums, libraries, search engines and certain AI / Machine Learning businesses. Although the technologies are well-developed, linked data approaches are relatively rare outside of these contexts, and so developers approaching Linked Data projects for the first time often have a steep learning curve.

Schema.org is a public project (W3C-held, search-engine funded) that provides the most widely-used classification framework for the semantic web.
21 changes: 21 additions & 0 deletions docs/primers/pace_layering.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
Pace Layering
=============

![Pace Layering](pace_layering.png)

Pace layering describes how different components of a system change at different rates, and how these layers interact. Layers closer to the centre move more slowly, and provide a stabilising effect. Layers on the outside change more quickly, responding to change in the environment almost immediately.

Layers further out are also:
Easier to describe - you can demonstrate that “red clothes are popular in my city this season” much more readily than any statement that is universally true of nature.
More applicable to immediate circumstances - warm coats are in fashion in some countries for a few weeks or months when the weather is colder, and then out of fashion again as the weather warms up
Where innovation and experimentation are easier - a new technique, a new textile, a new machine can be tried out without changing government, or culture.
Small drivers for change of lower layers - with decreasing influence the lower down the stack you go
Stabilised by lower layers - the bounds of what can be in fashion are set by by the lower layers

A data standard and its tooling can be positioned using pace layering, and this positioning allows us to understand the expected properties of the standard, as well as what else is required around it in order for it to be impactful.

Typically, a data standard can be positioned in pace layers by the concept that it’s modelling - if the concept changes rapidly, then it should be further up. The further up the layers the standard is, the more it will benefit from the use of standards that model concepts further down in order to help to stabilise it. Conversely, standards that are lower down will often need to be adapted or put to use by models further up in order to be meaningful.

For example, ISO 8601 is a standard that describes how to model dates using the Gregorian calendar. Calendars typically change over multi-century timescales, so it’s clearly low down on the layers. A date always needs context in order to mean anything - and so ISO 8601 is usually used by other standards to describe when a particular thing happens.

(Image reproduced from https://blog.longnow.org/02015/01/27/stewart-brand-pace-layers-thinking-at-the-interval/ under license CC BY-SA 3.0)
Binary file added docs/primers/pace_layering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions docs/primers/software_lifecycle.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
Software Lifecycle
==================

There is no single, universally agreed set of terms for describing how well-developed a piece of software is. Some organisations such as GDS have defined a lifecycle, using common terms, while many are content to use terms quite loosely.

There are two common sets of terminology, which are used interchangeably at times.

Alpha / Beta / Release Candidate / Release or Live

This set of terminology is normally used where there’s a well-defined product that’s being built - all experimentation and discovery is focussed on the details, rather than the fundamentals. The Alpha stage will normally be where the high-level architecture is defined and tested, and any particularly challenging technical problems will be identified and solutions tried. Beta will be when the software is largely ready to use, and it’s tried out by its intended users to identify if there are any parts that don’t work or are confusing. Release Candidate stage code should be ready to go, but in recognition that there are often last-minute problems that crop up, software often undergoes multiple RC rounds before being released. Any feature requests identified at beta stage or later are put aside for the next round of development, while bugs identified at RC stage may be dealt with straight away or left for later, depending on severity.

Discovery / Prototype / MVP / Iteration

This set of terminology is usually used when there’s a well-defined problem to be solved, but multiple solutions might be acceptable. Discovery is when initial understanding of the problem is developed, and Prototypes are developed to try out specific ideas, to see if potential solutions might work. Using the learning from prototypes, an MVP can be developed to validate the solution further, which can then be iterated on to improve it (usually, using further discovery and prototyping to understand potential improvements).

20 changes: 20 additions & 0 deletions docs/primers/tooling_in_open_data_ecosystems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
Tooling in Open Data Ecosystems
===============================

Open data ecosystems usually develop a range of tools to help to advocate for more or better publication of data, to help promote use of the data, and to ease the technical processes of working with the data.

Advocacy tools exist primarily to help convince someone of something. In open data, that’s often to help demonstrate to potential publishers why they should publish, to existing publishers why they should improve their publication, or to potential users of the data to demonstrate what the data might be able to do for them.

Examples include 360Giving GrantNav, which as well as being a useful tool in its own right is a potent advocacy tool for publishers - they’re proud to see their data appear the next day in a well-known and well-respected tool, and the absence of an organisation from the list can be a source of mild embarrassment.

Demonstration tools exist to stimulate innovation and to encourage people to think about what might be possible with the data. They can either offer incentives for interesting use of real existing data (which also then leads to valuable insight about the challenges of working with the real data), or by using fictitious data, as a way of demonstrating what would be possible if the data existed, or was of a certain quality. There’s some overlap with advocacy tools, although they’re usually less directly targeted to particular groups of (potential) users.

Examples include “proof of concept” tools, data use challenges, hack-and-learn events.

Technical tools exist to reduce the costs that anyone wanting to work with the data might incur; this is particularly valuable where a whole sector can work in the same way, or where new users can be helped to very quickly understand the possibilities of the data.

Examples include validators, aggregators, converters and visualisation tools.

Data Infrastructure is the tools and services that are required in order for an ecosystem to continue to operate. There’s often a lot of overlap with technical tools - an instance of a technical tool that’s run for the continued benefit of the community is often part of the data infrastructure.

Examples include registries, online conversion/validation tools and datastores.