Skip to content

Commit e1cae21

Browse files
authored
Merge branch 'main' into next
2 parents b9e8645 + 2ff1177 commit e1cae21

File tree

10 files changed

+714
-653
lines changed

10 files changed

+714
-653
lines changed

.husky/pre-push

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
npm test

.prettierignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,3 @@
11
content/docs/guides/csvw-data-package.md
2+
content/docs/guides/mediawiki-tabular-data.md
23
public/profiles

content/docs/guides/csvw-data-package.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ Data Package can define more than groups of tables. A [package](/standard/data-p
8686

8787
### Tables
8888

89-
| CSVW property | Data package support | Details |
89+
| CSVW property | Data Package support | Details |
9090
| ---- | ---- | ---- |
9191
| [url](http://table-url) | Yes | As [resource.path](/standard/data-resource/#path-or-data) |
9292
| [dialect](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#table-dialect) | Yes | As [resource.dialect](/standard/data-resource/#dialect) |
@@ -104,7 +104,7 @@ Data Package can define more than groups of tables. A [package](/standard/data-p
104104
Data Package [Table Schema](/standard/table-schema/) has features that CSVW schema does not, including [fieldMatch](/standard/table-schema/#fieldsMatch) for matching a schema with data, [missingValues](/standard/table-schema/#missingValues) for multiple (and labelled) missing values, and [uniqueKeys](/standard/table-schema/#uniqueKeys).
105105
:::
106106

107-
| CSVW property | Data package support | Details |
107+
| CSVW property | Data Package support | Details |
108108
| ---- | ---- | ---- |
109109
| [columns](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#schema-columns) | Yes | As [schema.fields](/standard/table-schema/#fields) |
110110
| [foreignKeys](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#schema-foreignKeys) | Yes | As [schema.foreignKeys](/standard/table-schema/#foreignKeys) |
@@ -115,7 +115,7 @@ Data Package [Table Schema](/standard/table-schema/) has features that CSVW sche
115115

116116
### Columns
117117

118-
| CSVW property | Data package support | Details |
118+
| CSVW property | Data Package support | Details |
119119
| ---- | ---- | ---- |
120120
| [name](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#column-name) | Yes | As [field.name](/standard/table-schema/#name) |
121121
| [suppressOutput](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#column-suppressOutput) | No | |
@@ -128,7 +128,7 @@ Data Package [Table Schema](/standard/table-schema/) has features that CSVW sche
128128

129129
Data Package properties do not inherit from their parent, unless otherwise specified (e.g. [resource.sources](/standard/data-resource/#sources)). The properties listed below only exist at one level in Data Package, except for `missingValues`.
130130

131-
| CSVW property | Data package support | Details |
131+
| CSVW property | Data Package support | Details |
132132
| ---- | ---- | ---- |
133133
| [aboutUrl](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#cell-aboutUrl) | Custom property | |
134134
| [datatype](http://cell-datatype) | Yes | As [field.type](/standard/table-schema/#type-and-format) |
@@ -152,7 +152,7 @@ Common properties can be added in Data Package as [custom properties](/standard/
152152
Data Package [Table Dialect](/standard/table-dialect/) was used as inspiration for CSVW dialect. It has features that CSVW dialect does not, since it covers tabular data formats beyond delimited text files, such as spreadsheets and databases. For delimited text files it supports [headerJoin](/standard/table-dialect/#headerJoin), [doubleQuote](/standard/table-dialect/#doubleQuote), [escapeChar](/standard/table-dialect/#escapeChar), and [nullSequence](/standard/table-dialect/#nullSequence), which CSVW does not.
153153
:::
154154

155-
| CSVW property | Data package support | Details |
155+
| CSVW property | Data Package support | Details |
156156
| ---- | ---- | ---- |
157157
| [commentPrefix](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-commentPrefix) | Yes | As [dialect.commentChar](/standard/table-dialect/#commentChar) |
158158
| [delimiter](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#dialect-delimiter) | Yes | As [dialect.delimiter](/standard/table-dialect/#delimiter) |
@@ -182,7 +182,7 @@ CSVW defines data types as built-in data types and derived data types. A derived
182182
Data Package [Table Schema](/standard/table-schema/) supports data types that CSVW does not, such as (labelled) [categories](/standard/table-schema/#categories) and [geojson](/standard/table-schema/#geojson). It also supports a number of constraints that CSVW does not, such as [unique](/standard/table-schema/#unique) values, [pattern](/standard/table-schema/#pattern) for regex comparison and [enum](/standard/table-schema/#enum) for controlled values, which allow rigorous data validation.
183183
:::
184184

185-
| CSVW property | Data package support | Details |
185+
| CSVW property | Data Package support | Details |
186186
| ---- | ---- | ---- |
187187
| [base](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#datatype-base) | No | All types are defined as [field.type](/standard/table-schema/#type-and-format) |
188188
| [format](https://www.w3.org/TR/2015/REC-tabular-metadata-20151217/#datatype-format) | Yes | As [field.format](/standard/table-schema/#type-and-format) |
Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
---
2+
title: Comparison with MediaWiki Tabular Data
3+
sidebar:
4+
order: 3
5+
---
6+
7+
<table>
8+
<tr>
9+
<th>Authors</th>
10+
<td>Jakob Voß</td>
11+
</tr>
12+
</table>
13+
14+
[MediaWiki](https://www.mediawiki.org/) is the software used to run Wikipedia and related projects of the Wikimedia Foundation, including the media file repository [Wikimedia Commons](https://commons.wikimedia.org/). Commons hosts mostly images but also some records with tabular data. The [MediaWiki Tabular Data Model](https://www.mediawiki.org/wiki/Help:Tabular_data) was inspired by Data Package version 1 but it slightly differs from current Data Package specification, as described below.
15+
16+
## Property Comparison
17+
18+
A [MediaWiki tabular data page](https://www.mediawiki.org/wiki/Help:Tabular_data) describes and contains an individual table of data similar to a [Data Resource](/standard/data-resource/) with inline tabular data. Both are serialized as JSON objects, but the former comes as a page with unique name in a MediaWiki instance (such as Wikimedia Commons).
19+
20+
### Top-level Properties
21+
22+
MediaWiki Tabular Data has three required and two optional top-level properties. Most of these properties map to corresponding properties of a Data Resource:
23+
24+
| MediaWiki Tabular Data | Data Package Table Schema |
25+
| ------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------- |
26+
| - (implied by page name) | [name](/standard/data-resource/#name) (required) is a string |
27+
| [description](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a localized string | [description](/standard/data-resource/#description) (optional) is a CommonMark string |
28+
| [data](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) | [data](/standard/data-resource/#name) (optional) |
29+
| [license](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) is the string `CC0-1.0` or another known identifier | [licenses](/standard/data-resource/#licenses) (optional) is an array |
30+
| [schema](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) as [described below](#schema-properties) | [schema](/standard/data-resource/#schema) (optional) can have multiple forms |
31+
| [sources](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a string with Wiki markup | [sources](/standard/data-resource/#sources) (optional) is an array of objects |
32+
33+
The differences are:
34+
35+
- property `name` does not exist but can be implied from page name
36+
- property `description` and `sources` have another format
37+
- property `data` is always an array of arrays and [data types](#data-types) of individual values can differ
38+
- property `schema` is required but it differs in definion of [schema properties](#schema-properties)
39+
- there is no property `licenses` but `license` fixed to plain string value `CC0-1.0` (other license indicators may be possible)
40+
41+
### Data Types
42+
43+
Tabular Data supports four data types that overlap with [Table Schema data types](/standard/table-schema/#field-types):
44+
45+
- `number` subset of Table Schema [number](/standard/table-schema/#number) (no `NaN`, `INF`, or `-INF`)
46+
- `boolean` same as Table Schema [boolean](/standard/table-schema/#boolean)
47+
- `string` subset of Table Schema [string](/standard/table-schema/#string) (limited to 400 characters at most and must not include `\n` or `\t`)
48+
- `localized ` refers to an object that maps language codes to strings with same limitations as `string` type.
49+
This type is not supported in Table Schema.
50+
51+
Individual values in a MediaWiki Tabular Data table can always be `null`, while in Table Schema you need to explicitly list values that should be considered missing in [schema.missingValues](/standard/table-schema/#missingValues).
52+
53+
### Schema Properties
54+
55+
The `schema` property of MediaWiki tabular contains an object with property `fields` just like [Table Schema](/standard/table-schema/) but no other properties are allowed. Elements of this array are like Table Schema [field descriptors](/standard/table-schema/#field) limited to three properties and different value spaces:
56+
57+
| MediaWiki Tabular Data | Data Package Table Schema |
58+
| ---------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------- |
59+
| [name](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) must be a string matching `^[a-zA-Z_][a-zA-Z_0-9]*` | [name](/standard/table-schema/#name) (required) can be any string |
60+
| [type](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (required) is one of the [Data Types above](#data-types) | [type](/standard/table-schema/#type) (optional) with [different data types](#data-types) |
61+
| [title](https://www.mediawiki.org/wiki/Help:Tabular_data#Top-level_fields) (optional) is a localized string | [title](/standard/table-schema/#title) (optional) is a plain string |

content/docs/recipes/relationship-between-fields.md

Lines changed: 43 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ title: Relationship between Fields
55
<table>
66
<tr>
77
<th>Authors</th>
8-
<td>Philippe Thomy, Peter Desmet</td>
8+
<td>Philippe Thomy</td>
99
</tr>
1010
</table>
1111

12-
The structure of tabular datasets is simple: a set of Fields grouped in a table.
12+
The structure of tabular datasets is simple: a set of fields grouped in a table.
1313

14-
However, the data present is often complex and reflects an interdependence between Fields (see explanations in the Internet-Draft [NTV tabular format (NTV-TAB)](https://www.ietf.org/archive/id/draft-thomy-ntv-tab-00.html#section-2)).
14+
However, the data present is often complex and reflects an interdependence between fields (see explanations in the Internet-Draft [NTV tabular format (NTV-TAB)](https://www.ietf.org/archive/id/draft-thomy-ntv-tab-00.html#section-2)).
1515

1616
Let's take the example of the following dataset:
1717

@@ -22,15 +22,15 @@ Let's take the example of the following dataset:
2222
| Estonia | European Union | ES | 449 |
2323
| Nigeria | Africa | NI | 1460 |
2424

25-
The data schema for this dataset indicates in the Field Descriptor "description":
25+
The data schema for this dataset has the following `description`:
2626

27-
- for the "code" Field : "country code alpha-2"
28-
- for the "population" Field: "region population in 2022 (millions)"
27+
- for the `code` field : "country code alpha-2"
28+
- for the `population` field: "region population in 2022 (millions)"
2929

3030
If we now look at the data we see that this dataset is not consistent because it contains two structural errors:
3131

32-
- The value of the "code" Field must be unique for each country, we cannot therefore have "ES" for "Spain" and "Estonia",
33-
- The value of the "population" Field of "European Union" cannot have two different values (449 and 48)
32+
- The value of the `code` Ffeld must be unique for each country, we cannot therefore have "ES" for "Spain" and "Estonia",
33+
- The value of the `population` field of "European Union" cannot have two different values (449 and 48)
3434

3535
These structural errors make the data unusable and yet they are not detected in the validation of the dataset (in the current version of Table Schema, there are no Descriptors to express this dependency between two fields).
3636

@@ -70,92 +70,96 @@ Two aspects need to be addressed:
7070

7171
A relationship is defined by the following information:
7272

73-
- the two Fields involved (the order of the Fields is important with the "derived" link),
73+
- the two fields involved (the order of the fields is important with the `derived` link),
7474
- the textual representation of the relationship,
7575
- the nature of the relationship
7676

7777
Three proposals for extending Table Schema are being considered:
7878

79-
- New Field Descriptor
80-
- New Constraint Property
81-
- New Table Descriptor
79+
- New field descriptor
80+
- New constraint property
81+
- New table descriptor
8282

83-
After discussions only the third is retained (a relationship between fields associated to a Field) and presented below:
83+
After discussions only the third is retained (a relationship between fields associated to a field) and presented below:
8484

85-
- **New Table Descriptor**:
85+
- **New table descriptor**:
8686

87-
A `relationships` Table Descriptor is added.
88-
The properties associated with this Descriptor could be:
87+
A `relationships` table descriptor is added.
88+
The properties associated with this descriptor could be:
8989

90-
- `fields`: array with the names of the two Fields involved
90+
- `fields`: array with the names of the two fields involved
9191
- `description`: description string (optional)
9292
- `link`: nature of the relationship
9393

9494
Pros:
9595

96-
- No mixing with Fields descriptors
96+
- No mixing with fields descriptors
9797

9898
Cons:
9999

100-
- Need to add a new Table Descriptor
101-
- The order of the Fields in the array is important with the "derived" link
100+
- Need to add a new table descriptor
101+
- The order of the fields in the array is important with the `derived` link
102102

103103
Example:
104104

105105
```json
106-
{ "fields": [ ],
106+
{
107+
"fields": [ ],
107108
"relationships": [
108-
{ "fields" : [ "country", "code"],
109+
{
110+
"fields" : ["country", "code"],
109111
"description" : "is the country code alpha-2 of",
110112
"link" : "coupled"
111113
}
112-
{ "fields" : [ "region", "population"],
114+
{
115+
"fields" : ["region", "population"],
113116
"description" : "is the population of",
114-
"link" : "derived"}
117+
"link" : "derived"
118+
}
115119
]
116120
}
117121
```
118122

119123
## Specification
120124

121-
Assuming solution 3 (Table Descriptor), the specification could be as follows:
125+
Assuming solution 3 (table descriptor), the specification could be as follows:
122126

123-
The `relationships` Descriptor MAY be used to define the dependency between fields.
127+
The `relationships` descriptor MAY be used to define the dependency between fields.
124128

125-
The `relationships` Descriptor, if present, MUST be an array where each entry in the array is an object and MUST contain two required properties and one optional:
129+
The `relationships` descriptor, if present, MUST be an array where each entry in the array is an object and MUST contain two required properties and one optional:
126130

127131
- `fields`: Array with the property `name` of the two fields linked (required)
128132
- `link` : String with the nature of the relationship between them (required)
129-
- `description` : String with the description of the relationship between the two Fields (optional)
133+
- `description` : String with the description of the relationship between the two fields (optional)
130134

131135
The `link` property value MUST be one of the three following :
132136

133-
- `derived` :
137+
- `derived`:
134138

135139
- The values of the child (second array element) field are dependant on the values of the parent (first array element) field (i.e. a value in the parent field is associated with a single value in the child field).
136-
- e.g. The "name" field [ "john", "paul", "leah", "paul" ] and the "Nickname" field [ "jock", "paulo", "lili", "paulo" ] are derived,
137-
- i.e. if a new entry "leah" is added, the corresponding "nickname" value must be "lili".
140+
- e.g. The `name` field ["john", "paul", "leah", "paul"] and the `nickname` field ["jock", "paulo", "lili", "paulo"] are derived,
141+
- i.e. if a new entry "leah" is added, the corresponding `nickname` value must be "lili".
138142

139-
- `coupled` :
143+
- `coupled`:
140144

141145
- The values of one field are associated to the values of the other field.
142-
- e.g. The "Country" field [ "france", "spain", "estonia", "spain" ] and the "code alpha-2" field [ "FR", "ES", "EE", "ES" ] are coupled,
143-
- i.e. if a new entry "estonia" is added, the corresponding "code alpha-2" value must be "EE" just as if a new entry "EE" is added, the corresponding "Country" value must be "estonia".
146+
- e.g. The `Country` field ["france", "spain", "estonia", "spain"] and the `code alpha-2` field ["FR", "ES", "EE", "ES"] are coupled,
147+
- i.e. if a new entry "estonia" is added, the corresponding `code alpha-2` value must be "EE" just as if a new entry "EE" is added, the corresponding `Country` value must be "estonia".
144148

145-
- `crossed` :
149+
- `crossed`:
146150

147151
- This relationship means that all the different values of one field are associated with all the different values of the other field.
148-
- e.g. the "Year" Field [ 2020, 2020, 2021, 2021] and the "Population" Field [ "estonia", "spain", "estonia", "spain" ] are crossed
152+
- e.g. the `Year` field [2020, 2020, 2021, 2021] and the `Population` field [ "estonia", "spain", "estonia", "spain"] are crossed
149153
- i.e the year 2020 is associated to population of "spain" and "estonia", just as the population of "estonia" is associated with years 2020 and 2021
150154

151155
## Implementations
152156

153-
The implementation of a new Descriptor is not discussed here (no particular point to address).
157+
The implementation of a new descriptor is not discussed here (no particular point to address).
154158

155159
The control implementation is based on the following principles:
156160

157-
- calculation of the number of different values for the two Fields,
158-
- calculation of the number of different values for the virtual Field composed of tuples of each of the values of the two Fields
161+
- calculation of the number of different values for the two fields,
162+
- calculation of the number of different values for the virtual field composed of tuples of each of the values of the two fields
159163
- comparison of these three values to deduce the type of relationship
160164
- comparison of the calculated relationship type with that defined in the data schema
161165

0 commit comments

Comments
 (0)