Skip to content

Commit 36713a3

Browse files
committed
Improved wording and tried to make language more generic
1 parent ee2369a commit 36713a3

File tree

1 file changed

+20
-18
lines changed

1 file changed

+20
-18
lines changed

csv-schema-1.0.html

Lines changed: 20 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
subtitle : "A Language for Defining and Validating CSV Data",
1919

2020
// if you wish the publication date to be other than today, set this
21-
publishDate: "2014-06-01",
21+
publishDate: "2014-07-11",
2222

2323
// if the specification's copyright date is a range of years, specify
2424
// the start date here:
@@ -174,46 +174,48 @@
174174
such as the <a href="http://w3.org">W3C</a>.
175175
</section>
176176
<section id='abstract'>
177-
<acronym title="Comma Separated Value">CSV</acronym> data comes in many shapes and sizes. Apart from [[RFC4180]] which is fairly recent and frequently ignored
178-
there is a lack of formal definition as to CSV data formats and in many ways this is one of its strengths.
177+
<acronym title="Comma Separated Value">CSV</acronym> (Comma Separated Value) data comes in many shapes and sizes. Apart from [[RFC4180]] which is a fairly recent development (and often ignored),
178+
there is a lack of formal definition as to CSV data formats, although in many ways this is one of the strengths of the CSV data format.
179179
However, extracting structured information from CSV data for further processing or storage
180180
can prove difficult if the CSV data is not well understood or perhaps not even uniform. CSV Schema
181181
defines a textual language which can be used to define the data structure, types and rules for
182-
particular CSV data. In parallel, a tool implementing this CSV Schema Language has been developed,
183-
see <a href="http://digital-preservation.github.io/csv-validator/">CSV Schema and Validator</a>
182+
CSV data formats.
184183
</section>
185184

186185
<section id="introduction" class='informative'>
187186
<h1>Introduction</h1>
188-
<p>The intention of this document is twofold:</p>
187+
<p>The intention of this document is two-fold:</p>
189188
<ol>
190189
<li>To be informative to users who are writing CSV Schemas, and provide a reference to the available syntax and functions.</li>
191-
<li>To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator following the rules defined in a CSV Schema.</li>
190+
<li>To provide enough detail such that anyone with sufficient technical expertise should be able to implement a CSV Schema parser and/or CSV validator evaluating the rules defined in a CSV Schema.</li>
192191
</ol>
193192
<section id="background">
194193
<h2>Background</h2>
195194
<p>
196-
The National Archives Digital Repository Infrastructure system archives digitised and born-digital materials provided by <acronym title="Other Governmental Department">OGD</acronym>s
197-
and occasionally <acronym title="Non Governmental Organisation">NGO</acronym>s. For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much
195+
The National Archives <acronym title="Digital Repository Infrastructure">DRI</acronym> (Digital Repository Infrastructure) system archives digitised and born-digital materials provided by <acronym title="Other Governmental Department">OGD</acronym>s (Other Government Departments)
196+
and occasionally <acronym title="Non Governmental Organisation">NGO</acronym>s (Non-Governmental Organisations). For the purposes of Digital Preservation the system processes and archives large amounts of metadata, much
198197
of this metadata is created by the supplying organisation or by transcription. The metadata is further processed, and ultimately stored both online in an
199198
<acronym title="Resource Description Format">RDF</acronym> Triplestore and a majority subset archived in a non-RDF <acronym title="eXtensible Markup Language">XML</acronym> format.
200199
However it was recognised that the creation of XML or RDF metadata by the supplier
201200
was most likely unrealistic for either technical or financial reasons. As such, CSV was recognised as a simple data format that is human readable (to a degree), that almost anyone could create
202-
simply; effectively CSV is the lowest common denominator in structured data formats.
201+
simply; CSV is the <em>lowest common denominator</em> of structured data formats.
203202
</p>
204203
<p>
205204
The National Archives have strict rules about various CSV file formats that they expect, and how the data in those file formats should be set out. To ensure the quality of their archival metadata
206-
it was recognised that CSV files would have to be validated, as such a general schema language for CSV was developed alongside a validation tool. For details of this tool,
207-
see GitHub pages, <a href="http://digital-preservation.github.io/csv-validator/">CSV Schema and Validator</a>
205+
it was recognised that CSV files would have to be validated. It was recognised that development of a schema language for CSV (and associated tools) would be of great benefit. It was
206+
also further recognised that a general CSV Schema language would be of greater benefit if it was made publicly available and invited collaboration from other organisations and
207+
individuals; the problem of CSV data formats is certainly not unique to The National Archives.
208208
</p>
209-
</section>
209+
<p>CSV Schema is a standard currently guided by The National Archives, but developed in an open source collaborative manner that invites collaboriation and contributions from all iterested parties.</p>
210+
<p>A reference implemenation has been created to prove the standard: The open source <a href="http://digital-preservation.github.io/csv-validator/">CSV Validator</a> application and API, offers both CSV Schema parsing and CSV file validation.</p>
211+
</section>
210212
<section id="principles">
211213
<h2>Guiding Principles</h2>
212214
<p>The design of the CSV Schema language has been influenced by a few guiding principles, understanding these will help you to understand how and why it is structured the way that it is.</p>
213215
<ul>
214216
<li>
215217
<div class="principle">Simplicity</div>
216-
<p>The language should be expressible in plain text and should be simple enough that archival domain experts could easily write it without having to know a programming language or data/document modelling language such as XML or RDF.</p>
218+
<p>The language should be expressible in plain text and should be simple enough that non-technical domain experts could easily write it without having to know a programming language or data/document modelling language such as XML, JSON or RDF.</p>
217219
<p><strong>Note</strong>, the CSV Schema Language is NOT itself expressed in CSV, it is expressed in a simple text format.</p>
218220
</li>
219221
<li>
@@ -222,7 +224,7 @@ <h2>Guiding Principles</h2>
222224
</li>
223225
<li>
224226
<div class="principle">Stream Processing</div>
225-
<p>Metadata files may be very large as such the CSV Schema Language was designed with concern for implementation of a validation tool which could read and process CSV data as a stream. Few operations require mnenomization of data from the CSV file, and where they do this is limited and should be optimisable to keep memory use to a minimum.</p>
227+
<p>CSV files may be very large and so the CSV Schema Language was designed with concern for implementations, that although not required by the specification, MAY wish to read and process CSV data as a stream. Few operations require mnenomization of data from the CSV file, and where they do this is limited and should be optimisable to keep memory use to a minimum.</p>
226228
</li>
227229
<li>
228230
<div class="principle">Sane Defaults</div>
@@ -241,7 +243,7 @@ <h1>Basics</h1>
241243
A CSV Schema is really a rules based language which defines how data in each cell should be formatted.
242244
Rules are expressed per-column of the CSV data. Rules are evaluated for each row in the CSV data.
243245
A column rule may express constraints based on the content of other columns in the same row, however at present there is no scope for looking forward or backward through rows directly.
244-
However, it possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)
246+
However, it is possible to check that a cell entry is unique within that column in the CSV file (or that the value of a combination of cells is unique)
245247
</p>
246248
<p>A CSV Schema is made up of two main parts:</p>
247249
<ol class="nested">
@@ -766,7 +768,7 @@ <h1>Column Validation Expressions</h1>
766768
then <code>Mr</code> would be regarded as invalid (strictly speaking that would also require the use of an <a>Explicit Context Expression</a> to refer to the other column,
767769
but that is a subexpression of the Non Conditional Expression class).
768770
</p>
769-
<p><b>NOTE</b> To increase control over expression applicability and to avoiding creating a <a href="https://en.wikipedia.org/wiki/Left_recursion">left-recursive</a> grammar (which could lead to problems for various parser implementations),
771+
<p><strong>NOTE</strong> To increase control over expression applicability and to avoiding creating a <a href="https://en.wikipedia.org/wiki/Left_recursion">left-recursive</a> grammar (which could lead to problems for various parser implementations),
770772
<a title="Column Validation Expression">Column Validation Expressions</a> have been further split into <a title="Combinatorial Expression">Combinatorial Expressions</a> and <a title="Non Combinatorial Expression">Non Combinatorial Expressions</a>.</p>
771773
<table class="ebnf-table">
772774
<tr>
@@ -1658,7 +1660,7 @@ <h2>Validation Errors</h2>
16581660
<p>
16591661
If column data does not validate successfully against a <a title="Column Rules">Column Rule</a>, an implementation SHOULD report a <dfn>Validation Error</dfn>.
16601662
It is implementation defined whether a Validation Error terminates execution, or whether execution continues. If execution continues, any further errors SHOULD be reported.</p>
1661-
<p><b>NOTE</b> The <a>Warning Directive</a> may be used within a Column Rule to specify that what would normally be a Validation Error should be
1663+
<p><strong>NOTE</strong> The <a>Warning Directive</a> may be used within a Column Rule to specify that what would normally be a Validation Error should be
16621664
treated only as a <a>Validation Warning</a>.
16631665
</p>
16641666
<p>

0 commit comments

Comments
 (0)