You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: csv-schema-1.0.html
+20-18Lines changed: 20 additions & 18 deletions
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@
18
18
subtitle : "A Language for Defining and Validating CSV Data",
19
19
20
20
// if you wish the publication date to be other than today, set this
21
-
publishDate: "2014-06-01",
21
+
publishDate: "2014-07-11",
22
22
23
23
// if the specification's copyright date is a range of years, specify
24
24
// the start date here:
@@ -174,46 +174,48 @@
174
174
such as the <ahref="http://w3.org">W3C</a>.
175
175
</section>
176
176
<sectionid='abstract'>
177
-
<acronymtitle="Comma Separated Value">CSV</acronym> data comes in many shapes and sizes. Apart from [[RFC4180]] which is fairly recent and frequently ignored
178
-
there is a lack of formal definition as to CSV data formats and in many ways this is one of its strengths.
177
+
<acronymtitle="Comma Separated Value">CSV</acronym>(Comma Separated Value) data comes in many shapes and sizes. Apart from [[RFC4180]] which is a fairly recent development (and often ignored),
178
+
there is a lack of formal definition as to CSV data formats, although in many ways this is one of the strengths of the CSV data format.
179
179
However, extracting structured information from CSV data for further processing or storage
180
180
can prove difficult if the CSV data is not well understood or perhaps not even uniform. CSV Schema
181
181
defines a textual language which can be used to define the data structure, types and rules for
182
-
particular CSV data. In parallel, a tool implementing this CSV Schema Language has been developed,
183
-
see <ahref="http://digital-preservation.github.io/csv-validator/">CSV Schema and Validator</a>
182
+
CSV data formats.
184
183
</section>
185
184
186
185
<sectionid="introduction" class='informative'>
187
186
<h1>Introduction</h1>
188
-
<p>The intention of this document is twofold:</p>
187
+
<p>The intention of this document is two-fold:</p>
189
188
<ol>
190
189
<li>To be informative to users who are writing CSV Schemas, and provide a reference to the available syntax and functions.</li>
191
-
<li>To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator following the rules defined in a CSV Schema.</li>
190
+
<li>To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator evaluating the rules defined in a CSV Schema.</li>
192
191
</ol>
193
192
<sectionid="background">
194
193
<h2>Background</h2>
195
194
<p>
196
-
The National Archives Digital Repository Infrastructure system archives digitised and born-digital materials provided by <acronymtitle="Other Governmental Department">OGD</acronym>s
197
-
and occasionally <acronymtitle="Non Governmental Organisation">NGO</acronym>s. For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much
195
+
The National Archives <acronymtitle="Digital Repository Infrastructure">DRI</acronym> (Digital Repository Infrastructure) system archives digitised and born-digital materials provided by <acronymtitle="Other Governmental Department">OGD</acronym>s (Other Government Departments)
196
+
and occasionally <acronymtitle="Non Governmental Organisation">NGO</acronym>s (Non-Governmental Organisations). For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much
198
197
of this metadata is created by the supplying organisation or by transcription. The metadata is further processed, and ultimately stored both online in an
199
198
<acronymtitle="Resource Description Format">RDF</acronym> Triplestore and a majority subset archived in a non-RDF <acronymtitle="eXtensible Markup Language">XML</acronym> format.
200
199
However it was recognised that the creation of XML or RDF metadata by the supplier
201
200
was most likely unrealistic for either technical or financial reasons. As such, CSV was recognised as a simple data format that is human readable (to a degree), that almost anyone could create
202
-
simply; effectively CSV is the lowest common denominator in structured data formats.
201
+
simply; CSV is the <em>lowest common denominator</em> of structured data formats.
203
202
</p>
204
203
<p>
205
204
The National Archives have strict rules about various CSV file formats that they expect, and how the data in those file formats should be set out. To ensure the quality of their archival metadata
206
-
it was recognised that CSV files would have to be validated, as such a general schema language for CSV was developed alongside a validation tool. For details of this tool,
207
-
see GitHub pages, <ahref="http://digital-preservation.github.io/csv-validator/">CSV Schema and Validator</a>
205
+
it was recognised that CSV files would have to be validated. It was recognised that development of a schema language for CSV (and associated tools) would be of great benefit. It was
206
+
also further recognised that a general CSV Schema language would be of greater benefit if it was made publicly available and invited collaboration from other organisations and
207
+
individuals; the problem of CSV data formats is certainly not unique to The National Archives.
208
208
</p>
209
-
</section>
209
+
<p>CSV Schema is a standard currently guided by The National Archives, but developed in an open source collaborative manner that invites collaboriation and contributions from all iterested parties.</p>
210
+
<p>A reference implemenation has been created to prove the standard: The open source <ahref="http://digital-preservation.github.io/csv-validator/">CSV Validator</a> application and API, offers both CSV Schema parsing and CSV file validation.</p>
211
+
</section>
210
212
<sectionid="principles">
211
213
<h2>Guiding Principles</h2>
212
214
<p>The design of the CSV Schema language has been influenced by a few guiding principles, understanding these will help you to understand how and why it is structured the way that it is.</p>
213
215
<ul>
214
216
<li>
215
217
<divclass="principle">Simplicity</div>
216
-
<p>The language should be expressible in plain text and should be simple enough that archival domain experts could easily write it without having to know a programming language or data/document modelling language such as XML or RDF.</p>
218
+
<p>The language should be expressible in plain text and should be simple enough that non-technical domain experts could easily write it without having to know a programming language or data/document modelling language such as XML, JSON or RDF.</p>
217
219
<p><strong>Note</strong>, the CSV Schema Language is NOT itself expressed in CSV, it is expressed in a simple text format.</p>
218
220
</li>
219
221
<li>
@@ -222,7 +224,7 @@ <h2>Guiding Principles</h2>
222
224
</li>
223
225
<li>
224
226
<divclass="principle">Stream Processing</div>
225
-
<p>Metadata files may be very large as such the CSV Schema Language was designed with concern for implementation of a validation tool which could read and process CSV data as a stream. Few operations require mnenomization of data from the CSV file, and where they do this is limited and should be optimisable to keep memory use to a minimum.</p>
227
+
<p>CSV files may be very large and so the CSV Schema Language was designed with concern for implementations, that although not required by the specification, MAY wish to read and process CSV data as a stream. Few operations require mnenomization of data from the CSV file, and where they do this is limited and should be optimisable to keep memory use to a minimum.</p>
226
228
</li>
227
229
<li>
228
230
<divclass="principle">Sane Defaults</div>
@@ -241,7 +243,7 @@ <h1>Basics</h1>
241
243
A CSV Schema is really a rules based language which defines how data in each cell should be formatted.
242
244
Rules are expressed per-column of the CSV data. Rules are evaluated for each row in the CSV data.
243
245
A column rule may express constraints based on the content of other columns in the same row, however at present there is no scope for looking forward or backward through rows directly.
244
-
However, it possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)
246
+
However, it is possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)
then <code>Mr</code> would be regarded as invalid (strictly speaking that would also require the use of an <a>Explicit Context Expression</a> to refer to the other column,
767
769
but that is a subexpression of the Non Conditional Expression class).
768
770
</p>
769
-
<p><b>NOTE</b> To increase control over expression applicability and to avoiding creating a <ahref="https://en.wikipedia.org/wiki/Left_recursion">left-recursive</a> grammar (which could lead to problems for various parser implementations),
771
+
<p><strong>NOTE</strong> To increase control over expression applicability and to avoiding creating a <ahref="https://en.wikipedia.org/wiki/Left_recursion">left-recursive</a> grammar (which could lead to problems for various parser implementations),
770
772
<atitle="Column Validation Expression">Column Validation Expressions</a> have been further split into <atitle="Combinatorial Expression">Combinatorial Expressions</a> and <atitle="Non Combinatorial Expression">Non Combinatorial Expressions</a>.</p>
771
773
<tableclass="ebnf-table">
772
774
<tr>
@@ -1658,7 +1660,7 @@ <h2>Validation Errors</h2>
1658
1660
<p>
1659
1661
If column data does not validate successfully against a <atitle="Column Rules">Column Rule</a>, an implementation SHOULD report a <dfn>Validation Error</dfn>.
1660
1662
It is implementation defined whether a Validation Error terminates execution, or whether execution continues. If execution continues, any further errors SHOULD be reported.</p>
1661
-
<p><b>NOTE</b> The <a>Warning Directive</a> may be used within a Column Rule to specify that what would normally be a Validation Error should be
1663
+
<p><strong>NOTE</strong> The <a>Warning Directive</a> may be used within a Column Rule to specify that what would normally be a Validation Error should be
0 commit comments