@@ -20,42 +20,43 @@ library(excelDataGuide)
2020
2121## Introduction
2222
23- Spreadsheets are a widely used tool in the biochemical laboratory, both to
24- record and to analyze experiments. When such experiments become a routine we
25- often create spreadsheet templates to save time and to structure our work. When
26- analyzing a lot of these experiments switching to a scripting language like R or
27- Python for analysis will become useful. Also in these cases the spreadsheet template
28- is a useful way to structure the recording of experimental data and metadata.
29-
30- The goal of the excelDataGuide package is to be able to use both Excel-compliant
31- spreadsheets and scripts as data analysis tools. Clearly, a scripting language
23+ The spreadsheet is a widely used tool in the biochemical laboratory, both for
24+ recording and analyzing experiments. When such experiments become a routine we
25+ often create spreadsheet templates to save time and to structure our work.
26+
27+ The goal of the excelDataGuide package is to be able to use Excel spreadsheets
28+ as well as scripts as data analysis tools. Clearly, a scripting language
3229has more potential when it comes to analyzing large data sets, consisting of
3330multiple notebooks.
3431
3532Importantly, ** the source of all data is the spreadsheet.** This concerns
3633metadata, parameters like acceptance criteria, concentrations and measured data.
37- This guarantees that calculations in the spreadsheet and in the scripts are all
38- based on the same underlying data.
39-
40- Concerning calculated data it may or may not be useful to let the spreadsheet be
41- the source of such data for the script as well. This may be particularly useful
42- when it concerns calculations that are carried out automatically upon entry
43- of data by the user.
34+ This * one-source* policy guarantees that calculations in the spreadsheet and in
35+ the scripts are all based on the same underlying data and parameters.
4436
45- Part of these data , like acceptance criteria, is determined in the SOP and fixed
37+ Parameters , like acceptance criteria, are determined in the SOP and fixed
4638in the spreadsheet template, whereas other data may vary per experiment and is
4739entered by the user. For example, when a user performs parameter fitting, it may
4840be useful to compare the fitted parameters to those obtained in another
4941programming environment.
5042
51- ## Writing a template
43+ Concerning calculated data it may or may not be useful to let the spreadsheet be
44+ the source of such data for the script as well. This may be particularly useful
45+ when it concerns calculations that are carried out automatically upon entry
46+ of data by the user.
47+
48+ ## Structuring a template
5249
5350Below is an example of the front page of a template (of the fitc-t4 TTR assay),
5451illustrating a number of ideas and concepts that we discuss below.
5552
5653![ front page] ( images/template_frontpage.png ) {width=100%}
5754
58- ### A template has a version number
55+ ### A template must have a version number
56+
57+ Unique template version numbers are a way to prevent misunderstandings
58+ between users and are also needed here to check whether a data guide is
59+ compatible with the template version.
5960
6061#### Version numbering rules
6162
@@ -71,43 +72,62 @@ is recorded should be *text*, and not *general* or *number*
7172
7273#### A template name is optional
7374
74- Preferably, a template also has a name. Note that the example above doesn't have
75- a name.
75+ Preferably, a template also has a name. Note that the example in the figure
76+ above doesn't have a name.
7677
7778#### Checking compatibilty of template versions and a guide version
7879
79- We use template version numbers to check compatibility with a guide. That is
80- because the same guide could in principle be used for multiple versions of a
81- template, for example because only explanatory texts or calculations have
82- changed but not locations of data. When checking version compatibility we
80+ We use template version numbers to check compatibility with a guide. In principle
81+ the same guide can be used for multiple versions of a template as long as the
82+ locations and names of variables indexed in the guide did not change. This is the
83+ case when, for example, only explanatory texts or calculations or data validity
84+ checks have changed in the template. When checking version compatibility we
8385assume that a guide is compatible with a consecutive range of template versions
8486between a minimal and a maximal version number.
8587
8688### All cells are protected except those for data entry
8789
88- Data entry cells have a distinct background color
90+ Data entry cells have a distinct background color, here "marker yellow". All
91+ other cells have protected status to prevent users from inadvertently changing
92+ them.
8993
9094### Include comments
9195
9296Refer to the SOP+ version
9397
9498### Built-in data entry checks
9599
100+ The validity of data entered by the users should be checked by validity checks,
101+ especially when misunderstandings are likely to happen. The validity checking
102+ capability by excel is limited. In cases where the data structure can not be
103+ properly described by a validity rule we add a comment next to the cell in which
104+ the data is entered.
105+
96106### A single source of parameters
97107
98108![ The parameters as key-value pairs] ( images/parameters.png ) {width=35%}
99109
110+ Parameters needed for calculations, for example for acceptance criteria of
111+ measurements are best entered on a separate sheet, and referred to by absolute
112+ references in calculations. In the case of the example we have a separate
113+ hidden sheet called * _ parameters* for this purpose. The information in this
114+ sheet is indexed in the data guide, and therefore available to R-scripts as
115+ well.
116+
100117### Use of hidden worksheets for data transfer
101118
102119![ A hidden sheet with links to plate-formetted data] ( images/data.png ) {width=100%}
103120
121+
122+
104123## What else?
105124
106125To facilitate automatic reading from the spreadsheet by scripts data must be
107126in either of these four formats:
108127
109128- ** keyvalue** format. Here, the key and value are placed in horizontally
110- adjacent cells (columns). The key is to be used as the parameter name in the
129+ adjacent cells (columns). The key, or its translated short name (see below)
130+ is to be used as the parameter name in the
111131scripts and should conform to variable naming rules for the scripting language
112132used. The key is found in the left-most cell of a cell range. The value can be a
113133single value (one cell) or a vector of values (multiple cells).
@@ -127,25 +147,22 @@ data guide.
127147
128148
129149The keyvalue format will be mostly used for metadata and parameters. All keyvalue
130- will be aggregated in a single named list caled "keyvalue".
150+ will be aggregated in a single named list called "keyvalue".
131151
132152The platedata format will be used for measured data and data concerning
133153concentrations in the plate wells. All ranges will be aggregated in a single
134154data frame with reported variables as column names, including the column names
135155"row" and "col", corresponding to the row and column names of the plate.
136156
137- Clearly, to make sure that calculations made in the spreadsheet and in the
138- script use the same values, the spreadsheet should use parameter values * etc.*
139- by the (preferably absolute) cell-reference mechanism, whereas the script should
140- use these values by reference to their variable names.
157+ ## Constructing a guide
141158
142- Every spreadsheet template should be accompanied by a guide indicating the
143- sheets and ranges in which keyvalue and platedata formatted data are to be found
144- in the filled out template. This guide is a yaml file.
159+ Every spreadsheet template should be accompanied by a data guide, and index
160+ registering the location of different data structures in the template. This
161+ guide is a yaml file, a human editable and computer readable file format .
145162
146- This guide is structured as follows :
163+ Below is an example of the first rows of a data guide :
147164
148- ``` { yaml}
165+ ``` yaml
149166guide.version : ' 1.0'
150167template.name : competition
151168template.min.version : ' 9.3'
@@ -162,80 +179,30 @@ locations:
162179 - sheet : description
163180 type : keyvalue
164181 translate : true
182+ atomicclass :
183+ - character
184+ - character
185+ - character
186+ - character
187+ - character
188+ - date
189+ - character
190+ - numeric
191+ - character
192+ - numeric
193+ - character
194+ - numeric
195+ - character
196+ - character
165197 varname : metadata
166198 ranges :
167- - A10:B14
168- - A16:B16
169- - A18:B18
170- - A20:B20
199+ - A10:B21
171200 - A24:B25
172- - sheet: concentration response
173- type: table
174- translate: false
175- varname: userresults
176- atomicclass: numeric
177- ranges:
178- - J3:M5
179- - sheet: BGfluo
180- type: cells
181- varname: userchecks
182- translate: false
183- atomicclass: numeric
184- variables:
185- - name: spread.itm1
186- cell: G6
187- - name: spread.itm2
188- cell: G33
189- translations:
190- - long: Version
191- short: template.version
192- - long: Template Name
193- short: template.name
194- - long: Study identifier
195- short: studyID
196- - long: Experiment identifier
197- short: exptID
198201# remainder not shown
199202```
200203
201204A guide must contain the following elements:
202205
203- - ** template.name** : the name of the data reporting template.
204- - ** template.version** : the version of the data reporting template.
205- - ** plate.format** : the format of the microplate used in the experiment (valid values are 24, 48, 96 and 384).
206- - ** locations** : a list of locations in the spreadsheet where data are to be found. Each location is a list of elements.
207- - ** translations** : a list of translations between long and short names for variables.
208-
209- The location data indicate where data are to be found, whereas the translation
210- part contains translations between long and short names for variables. Short
211- names are used as variable names in the scripts, whereas long names may be used
212- in the spreadsheet, in particular when these are visible to the user. In that
213- case the names should be translated before using them in the script. Reverse
214- translations may be used by the script in the output document.
215-
216- Required elements in a location are:
217-
218- - ** sheet** : the name of the sheet in which the data are to be found.
219- - ** type** : the format of the data in the range.
220- - ** translate** : (* true* , * false* ) whether the variable names should be translated before use in the script.
221- - ** varname** : the name of the variable in which the data will be available in the script.
222- - ** ranges** : the ranges in which the data are to be found.
223-
224- Furthermore, an optional element ** atomicclass** can be provided which can have
225- values "character", "numeric", "integer" or "logical". By default, values are
226- converted to character, but if desired otherwise as indicated by the
227- ** atomicclass** element then values are coerces. Note that coercion is performed
228- by the functions ` as.character ` , ` as.numeric ` , ` as.integer ` and ` as.logical `
229- respectively.
230-
231- The version should correspond to that reported in the template itself,
232- otherwise the file or the template is invalid. The user of this package
233- should take care of this check.
234-
235-
236-
237- ## Writing a guide
238-
239206### Required elements
240207
241208- <kbd >guide.version</kbd >: the version of the guide
@@ -259,9 +226,34 @@ should take care of this check.
259226 to check the correctness of dimensions of the ranges of ** platedata**
260227 elements.
261228
229+ The elements in ** locations** indicate where data are to be found, whereas the translation
230+ part contains translations between long and short names for variables. Short
231+ names are used as variable names in the scripts, whereas long names may be used
232+ in the spreadsheet, in particular when these are visible to the user. In that
233+ case the names should be translated before using them in the script. Reverse
234+ translations may be used by the script in the output document.
235+
236+ ## Locations
237+
238+ ### Required elements
239+
240+ - <kbd >sheet</kbd >: the name of the sheet in which the data are to be found.
241+ - <kbd >type</kbd >: the format of the data in the range.
242+ - <kbd >translate</kbd >: (* true* , * false* ) whether the variable names should be translated before use in the script.
243+ - <kbd >varname</kbd >: the name of the variable in which the data will be available in the script.
244+ - <kbd >ranges</kbd >: an array of ranges in which the data are to be found.
245+
246+ ### Optional element
247+
248+ - <kbd >atomicclass</kbd >: the class of the data in the ranges, Can have values "character",
249+ "numeric", "integer", "logical" or "date"., It can.be either a singleton or an array of class of the same
250+ length as the number of ranges. If a singleton then by default all values are converted to character.
251+ If an atomicclass is given then values are coerced. Coercion is performed by the functions ` as.character ` , ` as.numeric ` ,
252+ ` as.integer ` , ` as.logical ` , respectively, or in case of a date, by a function that produces a Date object.
253+
262254### Checking against the excelDataGuide json schema
263255
264- Correctness of the structure of a YAML file like a data guide can be
256+ Correctness of the structure and syntax of a YAML file like a data guide can be
265257checked against a JSON schema (See [ json-schema-everywhere] ( https://json-schema-everywhere.github.io/yaml ) ).
266258We provide a JSON schema called <kbd >excelguide_schema.json</kbd > in the folder
267259<kbd >data-raw</kbd >. We use the [ Polyglottal JSON Schema Validator] ( https://www.npmjs.com/package/pajv )
0 commit comments