update of documentation

douwe · douwe · commit a5cee54fb642 · 2025-04-04T09:59:28.000+02:00
diff --git a/vignettes/writing_templates_and_data_guides.Rmd b/vignettes/writing_templates_and_data_guides.Rmd
@@ -20,42 +20,43 @@ library(excelDataGuide)
 
 ## Introduction
 
-Spreadsheets are a widely used tool in the biochemical laboratory, both to 
-record and to analyze experiments. When such experiments become a routine we 
-often create spreadsheet templates to save time and to structure our work. When
-analyzing a lot of these experiments switching to a scripting language like R or
-Python for analysis will become useful. Also in these cases the spreadsheet template
-is a useful way to structure the recording of experimental data and metadata.
-
-The goal of the excelDataGuide package is to be able to use both Excel-compliant 
-spreadsheets and scripts as data analysis tools. Clearly, a scripting language 
+The spreadsheet is a widely used tool in the biochemical laboratory, both for 
+recording and analyzing experiments. When such experiments become a routine we 
+often create spreadsheet templates to save time and to structure our work.
+
+The goal of the excelDataGuide package is to be able to use Excel spreadsheets 
+as well as scripts as data analysis tools. Clearly, a scripting language 
 has more potential when it comes to analyzing large data sets, consisting of 
 multiple notebooks.
 
 Importantly, **the source of all data is the spreadsheet.** This concerns 
 metadata, parameters like acceptance criteria, concentrations and measured data.
-This guarantees that calculations in the spreadsheet and in the scripts are all 
-based on the same underlying data.
-
-Concerning calculated data it may or may not be useful to let the spreadsheet be
-the source of such data for the script as well. This may be particularly useful 
-when it concerns calculations that are carried out automatically upon entry
-of data by the user.
+This *one-source* policy guarantees that calculations in the spreadsheet and in
+the scripts are all based on the same underlying data and parameters.
 
-Part of these data, like acceptance criteria, is determined in the SOP and fixed
+Parameters, like acceptance criteria, are determined in the SOP and fixed
 in the spreadsheet template, whereas other data may vary per experiment and is
 entered by the user. For example, when a user performs parameter fitting, it may
 be useful to compare the fitted parameters to those obtained in another 
 programming environment.
 
-## Writing a template
+Concerning calculated data it may or may not be useful to let the spreadsheet be
+the source of such data for the script as well. This may be particularly useful 
+when it concerns calculations that are carried out automatically upon entry
+of data by the user.
+
+## Structuring a template
 
 Below is an example of the front page of a template (of the fitc-t4 TTR assay),
 illustrating a number of ideas and concepts that we discuss below.
 
 ![front page](images/template_frontpage.png){width=100%}
 
-### A template has a version number
+### A template must have a version number
+
+Unique template version numbers are a way to prevent misunderstandings 
+between users and are also needed here to check whether a data guide is 
+compatible with the template version. 
 
 #### Version numbering rules
 
@@ -71,43 +72,62 @@ is recorded should be *text*, and not *general* or *number*
 
 #### A template name is optional 
 
-Preferably, a template also has a name. Note that the example above doesn't have
-a name.
+Preferably, a template also has a name. Note that the example in the figure 
+above doesn't have a name.
 
 #### Checking compatibilty of template versions and a guide version
 
-We use template version numbers to check compatibility with a guide. That is 
-because the same guide could in principle be used for multiple versions of a 
-template, for example because only explanatory texts or calculations have 
-changed but not locations of data. When checking version compatibility we 
+We use template version numbers to check compatibility with a guide. In principle
+the same guide can be used for multiple versions of a template as long as the 
+locations and names of variables indexed in the guide did not change. This is the 
+case when, for example, only explanatory texts or calculations or data validity
+checks have changed in the template. When checking version compatibility we 
 assume that a guide is compatible with a consecutive range of template versions
 between a minimal and a maximal version number.
 
 ### All cells are protected except those for data entry
 
-Data entry cells have a distinct background color
+Data entry cells have a distinct background color, here "marker yellow". All 
+other cells have protected status to prevent users from inadvertently changing
+them.
 
 ### Include comments
 
 Refer to the SOP+ version 
 
 ### Built-in data entry checks
 
+The validity of data entered by the users should be checked by validity checks, 
+especially when misunderstandings are likely to happen. The validity checking
+capability by excel is limited. In cases where the data structure can not be 
+properly described by a validity rule we add a comment next to the cell in which
+the data is entered.
+
 ### A single source of parameters
 
 ![The parameters as key-value pairs](images/parameters.png){width=35%}
 
+Parameters needed for calculations, for example for acceptance criteria of 
+measurements are best entered on a separate sheet, and referred to by absolute
+references in calculations. In the case of the example we have a separate 
+hidden sheet called *_parameters* for this purpose. The information in this 
+sheet is indexed in the data guide, and therefore available to R-scripts as 
+well.
+
 ### Use of hidden worksheets for data transfer
 
 ![A hidden sheet with links to plate-formetted data](images/data.png){width=100%}
 
+
+
 ## What else?
 
 To facilitate automatic reading from the spreadsheet by scripts data must be 
 in either of these four formats:
 
 - **keyvalue** format. Here, the key and value are placed in horizontally 
-adjacent cells (columns). The key is to be used as the parameter name in the
+adjacent cells (columns). The key, or its translated short name (see below)
+is to be used as the parameter name in the
 scripts and should conform to variable naming rules for the scripting language
 used. The key is found in the left-most cell of a cell range. The value can be a
 single value (one cell) or a vector of values (multiple cells).
@@ -127,25 +147,22 @@ data guide.
  
 
 The keyvalue format will be mostly used for metadata and parameters. All keyvalue 
-will be aggregated in a single named list caled "keyvalue".
+will be aggregated in a single named list called "keyvalue".
 
 The platedata format will be used for measured data and data concerning 
 concentrations in the plate wells. All ranges will be aggregated in a single 
 data frame with reported variables as column names, including the column names 
 "row" and "col", corresponding to the row and column names of the plate.
 
-Clearly, to make sure that calculations made in the spreadsheet and in the 
-script use the same values, the spreadsheet should use parameter values *etc.*
-by the (preferably absolute) cell-reference mechanism, whereas the script should
-use these values by reference to their variable names.
+## Constructing a guide
 
-Every spreadsheet template should be accompanied by a guide indicating the 
-sheets and ranges in which keyvalue and platedata formatted data are to be found
-in the filled out template. This guide is a yaml file.
+Every spreadsheet template should be accompanied by a data guide, and index 
+registering the location of different data structures in the template. This
+guide is a yaml file, a human editable and computer readable file format.
 
-This guide is structured as follows:
+Below is an example of the first rows of a data guide:
 
-```{yaml}
+``` yaml
 guide.version: '1.0'
 template.name: competition
 template.min.version: '9.3'
@@ -162,80 +179,30 @@ locations:
   - sheet: description
     type: keyvalue
     translate: true
+    atomicclass:
+      - character
+      - character
+      - character
+      - character
+      - character
+      - date
+      - character
+      - numeric
+      - character
+      - numeric
+      - character
+      - numeric
+      - character
+      - character
     varname: metadata
     ranges:
-      - A10:B14
-      - A16:B16
-      - A18:B18
-      - A20:B20
+      - A10:B21
       - A24:B25
-  - sheet: concentration response
-    type: table
-    translate: false
-    varname: userresults
-    atomicclass: numeric
-    ranges:
-      - J3:M5
-  - sheet: BGfluo
-    type: cells
-    varname: userchecks
-    translate: false
-    atomicclass: numeric
-    variables:
-      - name: spread.itm1
-        cell: G6
-      - name: spread.itm2
-        cell: G33
-translations:
-  - long: Version
-    short: template.version
-  - long: Template Name
-    short: template.name
-  - long: Study identifier
-    short: studyID
-  - long: Experiment identifier
-    short: exptID
 # remainder not shown
 ```
 
 A guide must contain the following elements:
 
-- **template.name**: the name of the data reporting template.
-- **template.version**: the version of the data reporting template.
-- **plate.format**: the format of the microplate used in the experiment (valid values are 24, 48, 96 and 384).
-- **locations**: a list of locations in the spreadsheet where data are to be found. Each location is a list of elements.
-- **translations**: a list of translations between long and short names for variables.
-
-The location data indicate where data are to be found, whereas the translation 
-part contains translations between long and short names for variables. Short 
-names are used as variable names in the scripts, whereas long names may be used 
-in the spreadsheet, in particular when these are visible to the user. In that 
-case the names should be translated before using them in the script. Reverse 
-translations may be used by the script in the output document.
-
-Required elements in a location are:
-
-- **sheet**: the name of the sheet in which the data are to be found.
-- **type**: the format of the data in the range.
-- **translate**: (*true*, *false*) whether the variable names should be translated before use in the script.
-- **varname**: the name of the variable in which the data will be available in the script.
-- **ranges**: the ranges in which the data are to be found.
-
-Furthermore, an optional element **atomicclass** can be provided which can have
-values "character", "numeric", "integer" or "logical". By default, values are
-converted to character, but if desired otherwise as indicated by the 
-**atomicclass** element then values are coerces. Note that coercion is performed
-by the functions `as.character`, `as.numeric`, `as.integer` and `as.logical` 
-respectively.
-
-The version should correspond to that reported in the template itself, 
-otherwise the file or the template is invalid. The user of this package 
-should take care of this check.
-
-
-
-## Writing a guide
-
 ### Required elements
 
 -  <kbd>guide.version</kbd>: the version of the guide
@@ -259,9 +226,34 @@ should take care of this check.
    to check the correctness of dimensions of the ranges of **platedata** 
    elements.
 
+The elements in **locations** indicate where data are to be found, whereas the translation 
+part contains translations between long and short names for variables. Short 
+names are used as variable names in the scripts, whereas long names may be used 
+in the spreadsheet, in particular when these are visible to the user. In that 
+case the names should be translated before using them in the script. Reverse 
+translations may be used by the script in the output document.
+
+## Locations
+
+### Required elements
+
+- <kbd>sheet</kbd>: the name of the sheet in which the data are to be found.
+- <kbd>type</kbd>: the format of the data in the range.
+- <kbd>translate</kbd>: (*true*, *false*) whether the variable names should be translated before use in the script.
+- <kbd>varname</kbd>: the name of the variable in which the data will be available in the script.
+- <kbd>ranges</kbd>: an array of ranges in which the data are to be found.
+
+### Optional element
+
+- <kbd>atomicclass</kbd>: the class of the data in the ranges, Can have values "character", 
+  "numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same 
+  length as the number of ranges. If a singleton then by default all values are converted to character. 
+  If an atomicclass is given then values are coerced. Coercion is performed by the functions `as.character`, `as.numeric`,
+  `as.integer`, `as.logical`, respectively, or in case of a date, by a function that produces a Date object.
+
 ### Checking against the excelDataGuide json schema
 
-Correctness of the structure of a YAML file like a data guide can be 
+Correctness of the structure and syntax of a YAML file like a data guide can be 
 checked against a JSON schema (See [json-schema-everywhere](https://json-schema-everywhere.github.io/yaml)). 
 We provide a JSON schema called <kbd>excelguide_schema.json</kbd> in the folder
 <kbd>data-raw</kbd>. We use the [Polyglottal JSON Schema Validator](https://www.npmjs.com/package/pajv)