Skip to content

Commit e3bdb2e

Browse files
committed
Suggest compressed csv + parquet approach
1 parent d07d9b6 commit e3bdb2e

File tree

1 file changed

+49
-36
lines changed

1 file changed

+49
-36
lines changed

pages/faq.md

Lines changed: 49 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,10 @@ permalink: /faq/
55
toc: true
66
---
77

8+
<!-- References -->
9+
[camtrapdp]: https://inbo.github.io/camtrapdp/
10+
[frictionless-py]: https://framework.frictionlessdata.io/
11+
812
{:id="bboxes"}
913
## How to describe bounding boxes of detected objects?
1014

@@ -120,45 +124,54 @@ We provide an [R package](https://inbo.github.io/camtrapdp/) to read and manipul
120124

121125
Consult the merge function documentation to understand exactly how specific fields are merged to avoid information loss. Please note that when merging data packages x and y, the [`project$samplingDesign`](/metadata/#project.samplingDesign) field in the resulting package will be set to the value of `project$samplingDesign` from data package x. Therefore, we recommend merging data packages only for projects that use the same sampling design.
122126

123-
{:id="parquet"}
124-
## Can I use Parquet format instead of CSV for very large tables (>1M rows)?
127+
{:id="large-tables"}
128+
## Do I need to use CSV files?
125129

126-
[Apache Parquet](https://parquet.apache.org/) is an open source data file format, designed for efficient data storage and retrieval. `"mediatype": "application/vnd.apache.parquet"` is a [registered media type](https://www.iana.org/assignments/media-types/application/vnd.apache.parquet).
130+
No. Some studies have media and observations tables with over a million records, which may be hard to produce or consume as CSV files. Here are two approaches for formatting large files:
127131

128-
Frictionless framework can be used to read and write Parquet files after installing an [extension](https://framework.frictionlessdata.io/docs/formats/parquet.html).
129-
As of Camtrap DP [1.0.2](https://github.com/tdwg/camtrap-dp/releases/tag/1.0.2), the standard supports using Parquet files for storing data. This is an example of the `resources` section of the package metadata, adapted for using Parquet format files:
132+
### gzipped CSV files
130133

131-
```
132-
"resources": [
133-
{
134-
"name": "deployments",
135-
"type": "table",
136-
"profile": "tabular-data-resource",
137-
"path": "deployments.parquet",
138-
"format": "parquet",
139-
"mediatype": "application/vnd.apache.parquet",
140-
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/deployments-table-schema.json"
141-
},
142-
{
143-
"name": "media",
144-
"type": "table",
145-
"profile": "tabular-data-resource",
146-
"path": "media.parquet",
147-
"format": "parquet",
148-
"mediatype": "application/vnd.apache.parquet",
149-
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/media-table-schema.json"
150-
},
151-
{
152-
"name": "observations",
153-
"type": "table",
154-
"profile": "tabular-data-resource",
155-
"path": "observations.parquet",
156-
"format": "parquet",
157-
"mediatype": "application/vnd.apache.parquet",
158-
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.1/observations-table-schema.json"
159-
}
160-
],
161-
```
134+
By compressing a CSV file, you can often reduce its size by a factor. We recommend gzip over zip, as it allows direct file reading. Compressed CSV files are supported in all versions of Camtrap DP, by [frictionless-py][frictionless-py] and the [camtrapdp][camtrapdp] R package.
135+
136+
1. Compress the file:
137+
138+
```
139+
gzip media.csv
140+
```
141+
142+
2. Refer to the compressed CSV file in the `datapackage.json` as follows:
143+
144+
```json
145+
{
146+
"name": "media",
147+
"path": "media.csv.gz",
148+
"profile": "tabular-data-resource",
149+
"format": "csv",
150+
"mediatype": "text/csv",
151+
"encoding": "UTF-8",
152+
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
153+
}
154+
```
155+
156+
### Apache parquet
157+
158+
[Apache Parquet](https://parquet.apache.org/) is an open source data file format, designed for efficient data storage and retrieval. Parquet files are supported in Camtrap DP 1.0.2, by the [frictionless-py][frictionless-py] after installing an [extension](https://framework.frictionlessdata.io/docs/formats/parquet.html), but **not by the [camtrapdp][camtrapdp] R package** (as it is not yet supported by [its dependency](https://github.com/frictionlessdata/frictionless-r/issues/117)).
159+
160+
1. Create the parquet file (e.g. with the arrow R package).
161+
162+
2. Refer to the parquet file in the `datapackage.json` as follows:
163+
164+
```json
165+
{
166+
"name": "media",
167+
"path": "media.parquet",
168+
"profile": "tabular-data-resource",
169+
"format": "parquet",
170+
"mediatype": "application/vnd.apache.parquet",
171+
"encoding": "UTF-8",
172+
"schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
173+
}
174+
```
162175
163176
{:id="ask"}
164177
## Have a question?

0 commit comments

Comments
 (0)