Skip to content

Commit e4d1467

Browse files
authored
Merge pull request #413 from tdwg/faq-parquet
added FAQ entry about using Parquet files for data
2 parents 82c707e + e3bdb2e commit e4d1467

File tree

1 file changed

+98
-45
lines changed

1 file changed

+98
-45
lines changed

pages/faq.md

Lines changed: 98 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ permalink: /faq/
55
toc: true
66
---
77

8+
<!-- References -->
9+
[camtrapdp]: https://inbo.github.io/camtrapdp/
10+
[frictionless-py]: https://framework.frictionlessdata.io/
11+
812
{:id="bboxes"}
913
## How to describe bounding boxes of detected objects?
1014

@@ -36,7 +40,7 @@ There are two ways to include additional information (values not covered by the
3640

3741
### Using tags
3842

39-
Deployment and observation tables include [`deploymentTags`](/data/#deployments.deploymentTags) and [`observationTags`](/data/#observations.observationTags) fields. You can use these fields to store additional information as key:value pairs, separated by a pipe character (|). For example, this is how temperature and snow cover information could be represented in the deployment table:
43+
Deployment and observation tables include [`deploymentTags`](/data/#deployments.deploymentTags) and [`observationTags`](/data/#observations.observationTags) fields. You can use these fields to store additional information as key:value pairs, separated by a pipe character (`|`). For example, this is how temperature and snow cover information could be represented in the deployment table:
4044

4145
deploymentID | deploymentTags
4246
--- | ---
@@ -51,50 +55,50 @@ You can add a custom table to the data package to store additional information.
5155

5256
```json
5357
{
54-
"name": "deployment-measurements",
55-
"title": "Deployment measurements",
56-
"description": "Table with weather measurements for deployments. Associated with deployments (`deploymentID`).",
57-
"fields": [
58-
{
59-
"name": "deploymentID",
60-
"description": "Identifier of the deployment. Foreign key to `deployments.deploymentID`.",
61-
"skos:broadMatch": "http://rs.tdwg.org/dwc/terms/parentEventID",
62-
"type": "string",
63-
"constraints": {
64-
"required": true
65-
},
66-
"example": "dep1"
67-
},
68-
{
69-
"name": "temperature",
70-
"description": "Temperature (in Celsius) at the time of the observation.)",
71-
"type": "number",
72-
"constraints": {
73-
"required": false,
74-
"minimum": -50,
75-
"maximum": 100
76-
},
77-
"example": 19.5
78-
},
79-
{
80-
"name": "snowCover",
81-
"description": "Snow cover present at the time of the observation.",
82-
"type": "boolean",
83-
"constraints": {
84-
"required": false
85-
},
86-
"example": true
87-
}
88-
],
89-
"foreignKeys": [
90-
{
91-
"fields": "deploymentID",
92-
"reference": {
93-
"resource": "deployments",
94-
"fields": "deploymentID"
95-
}
96-
}
97-
]
58+
"name": "deployment-measurements",
59+
"title": "Deployment measurements",
60+
"description": "Table with weather measurements for deployments. Associated with deployments (`deploymentID`).",
61+
"fields": [
62+
{
63+
"name": "deploymentID",
64+
"description": "Identifier of the deployment. Foreign key to `deployments.deploymentID`.",
65+
"skos:broadMatch": "http://rs.tdwg.org/dwc/terms/parentEventID",
66+
"type": "string",
67+
"constraints": {
68+
"required": true
69+
},
70+
"example": "dep1"
71+
},
72+
{
73+
"name": "temperature",
74+
"description": "Temperature (in Celsius) at the time of the observation.)",
75+
"type": "number",
76+
"constraints": {
77+
"required": false,
78+
"minimum": -50,
79+
"maximum": 100
80+
},
81+
"example": 19.5
82+
},
83+
{
84+
"name": "snowCover",
85+
"description": "Snow cover present at the time of the observation.",
86+
"type": "boolean",
87+
"constraints": {
88+
"required": false
89+
},
90+
"example": true
91+
}
92+
],
93+
"foreignKeys": [
94+
{
95+
"fields": "deploymentID",
96+
"reference": {
97+
"resource": "deployments",
98+
"fields": "deploymentID"
99+
}
100+
}
101+
]
98102
}
99103
```
100104

@@ -120,6 +124,55 @@ We provide an [R package](https://inbo.github.io/camtrapdp/) to read and manipul
120124

121125
Consult the merge function documentation to understand exactly how specific fields are merged to avoid information loss. Please note that when merging data packages x and y, the [`project$samplingDesign`](/metadata/#project.samplingDesign) field in the resulting package will be set to the value of `project$samplingDesign` from data package x. Therefore, we recommend merging data packages only for projects that use the same sampling design.
122126

127+
{:id="large-tables"}
128+
## Do I need to use CSV files?
129+
130+
No. Some studies have media and observations tables with over a million records, which may be hard to produce or consume as CSV files. Here are two approaches for formatting large files:
131+
132+
### gzipped CSV files
133+
134+
By compressing a CSV file, you can often reduce its size by a factor. We recommend gzip over zip, as it allows direct file reading. Compressed CSV files are supported in all versions of Camtrap DP, by [frictionless-py][frictionless-py] and the [camtrapdp][camtrapdp] R package.
135+
136+
1. Compress the file:
137+
138+
```
139+
gzip media.csv
140+
```
141+
142+
2. Refer to the compressed CSV file in the `datapackage.json` as follows:
143+
144+
```json
145+
{
146+
"name": "media",
147+
"path": "media.csv.gz",
148+
"profile": "tabular-data-resource",
149+
"format": "csv",
150+
"mediatype": "text/csv",
151+
"encoding": "UTF-8",
152+
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
153+
}
154+
```
155+
156+
### Apache parquet
157+
158+
[Apache Parquet](https://parquet.apache.org/) is an open source data file format, designed for efficient data storage and retrieval. Parquet files are supported in Camtrap DP 1.0.2, by the [frictionless-py][frictionless-py] after installing an [extension](https://framework.frictionlessdata.io/docs/formats/parquet.html), but **not by the [camtrapdp][camtrapdp] R package** (as it is not yet supported by [its dependency](https://github.com/frictionlessdata/frictionless-r/issues/117)).
159+
160+
1. Create the parquet file (e.g. with the arrow R package).
161+
162+
2. Refer to the parquet file in the `datapackage.json` as follows:
163+
164+
```json
165+
{
166+
"name": "media",
167+
"path": "media.parquet",
168+
"profile": "tabular-data-resource",
169+
"format": "parquet",
170+
"mediatype": "application/vnd.apache.parquet",
171+
"encoding": "UTF-8",
172+
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
173+
}
174+
```
175+
123176
{:id="ask"}
124177
## Have a question?
125178

0 commit comments

Comments
 (0)