Merge pull request #413 from tdwg/faq-parquet

peterdesmet · web-flow · commit e4d14670acdc · 2025-06-26T13:48:20.000+02:00
added FAQ entry about using Parquet files for data
diff --git a/pages/faq.md b/pages/faq.md
@@ -5,6 +5,10 @@ permalink: /faq/
 toc: true
 ---
 
+<!-- References -->
+[camtrapdp]: https://inbo.github.io/camtrapdp/
+[frictionless-py]: https://framework.frictionlessdata.io/
+
 {:id="bboxes"}
 ## How to describe bounding boxes of detected objects?
 
@@ -36,7 +40,7 @@ There are two ways to include additional information (values not covered by the
 
 ### Using tags
 
-Deployment and observation tables include [`deploymentTags`](/data/#deployments.deploymentTags) and [`observationTags`](/data/#observations.observationTags) fields. You can use these fields to store additional information as key:value pairs, separated by a pipe character (|). For example, this is how temperature and snow cover information could be represented in the deployment table:
+Deployment and observation tables include [`deploymentTags`](/data/#deployments.deploymentTags) and [`observationTags`](/data/#observations.observationTags) fields. You can use these fields to store additional information as key:value pairs, separated by a pipe character (`|`). For example, this is how temperature and snow cover information could be represented in the deployment table:
 
 deploymentID | deploymentTags
 --- | ---
@@ -51,50 +55,50 @@ You can add a custom table to the data package to store additional information.
 
 ```json
 {
-    "name": "deployment-measurements",
-    "title": "Deployment measurements",
-    "description": "Table with weather measurements for deployments. Associated with deployments (`deploymentID`).",
-    "fields": [
-        {
-            "name": "deploymentID",
-            "description": "Identifier of the deployment. Foreign key to `deployments.deploymentID`.",
-            "skos:broadMatch": "http://rs.tdwg.org/dwc/terms/parentEventID",
-            "type": "string",
-            "constraints": {
-                "required": true
-            },
-            "example": "dep1"
-        },
-        {
-            "name": "temperature",
-            "description": "Temperature (in Celsius) at the time of the observation.)",
-            "type": "number",
-            "constraints": {
-                "required": false,
-                "minimum": -50,
-                "maximum": 100
-            },
-            "example": 19.5
-        },
-        {
-            "name": "snowCover",
-            "description": "Snow cover present at the time of the observation.",
-            "type": "boolean",
-            "constraints": {
-                "required": false
-            },
-            "example": true
-        }
-    ],
-    "foreignKeys": [
-        {
-            "fields": "deploymentID",
-            "reference": {
-                "resource": "deployments",
-                "fields": "deploymentID"
-            }
-        }
-    ]
+  "name": "deployment-measurements",
+  "title": "Deployment measurements",
+  "description": "Table with weather measurements for deployments. Associated with deployments (`deploymentID`).",
+  "fields": [
+    {
+      "name": "deploymentID",
+      "description": "Identifier of the deployment. Foreign key to `deployments.deploymentID`.",
+      "skos:broadMatch": "http://rs.tdwg.org/dwc/terms/parentEventID",
+      "type": "string",
+      "constraints": {
+        "required": true
+      },
+      "example": "dep1"
+    },
+    {
+      "name": "temperature",
+      "description": "Temperature (in Celsius) at the time of the observation.)",
+      "type": "number",
+      "constraints": {
+        "required": false,
+        "minimum": -50,
+        "maximum": 100
+      },
+      "example": 19.5
+    },
+    {
+      "name": "snowCover",
+      "description": "Snow cover present at the time of the observation.",
+      "type": "boolean",
+      "constraints": {
+        "required": false
+      },
+      "example": true
+    }
+  ],
+  "foreignKeys": [
+    {
+      "fields": "deploymentID",
+      "reference": {
+        "resource": "deployments",
+        "fields": "deploymentID"
+      }
+    }
+  ]
 }
 ```
 
@@ -120,6 +124,55 @@ We provide an [R package](https://inbo.github.io/camtrapdp/) to read and manipul
 
 Consult the merge function documentation to understand exactly how specific fields are merged to avoid information loss. Please note that when merging data packages x and y, the [`project$samplingDesign`](/metadata/#project.samplingDesign) field in the resulting package will be set to the value of `project$samplingDesign` from data package x. Therefore, we recommend merging data packages only for projects that use the same sampling design.
 
+{:id="large-tables"}
+## Do I need to use CSV files?
+
+No. Some studies have media and observations tables with over a million records, which may be hard to produce or consume as CSV files. Here are two approaches for formatting large files:
+
+### gzipped CSV files
+
+By compressing a CSV file, you can often reduce its size by a factor. We recommend gzip over zip, as it allows direct file reading. Compressed CSV files are supported in all versions of Camtrap DP, by [frictionless-py][frictionless-py] and the [camtrapdp][camtrapdp] R package.
+
+1. Compress the file:
+
+    ```
+    gzip media.csv
+    ```
+
+2. Refer to the compressed CSV file in the `datapackage.json` as follows:
+
+    ```json
+    {
+      "name": "media",
+      "path": "media.csv.gz",
+      "profile": "tabular-data-resource",
+      "format": "csv",
+      "mediatype": "text/csv",
+      "encoding": "UTF-8",
+      "schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
+    }
+    ```
+
+### Apache parquet
+
+[Apache Parquet](https://parquet.apache.org/) is an open source data file format, designed for efficient data storage and retrieval. Parquet files are supported in Camtrap DP 1.0.2, by the [frictionless-py][frictionless-py] after installing an [extension](https://framework.frictionlessdata.io/docs/formats/parquet.html), but **not by the [camtrapdp][camtrapdp] R package** (as it is not yet supported by [its dependency](https://github.com/frictionlessdata/frictionless-r/issues/117)).
+
+1. Create the parquet file (e.g. with the arrow R package).
+
+2. Refer to the parquet file in the `datapackage.json` as follows:
+
+    ```json
+    {
+      "name": "media",
+      "path": "media.parquet",
+      "profile": "tabular-data-resource",
+      "format": "parquet",
+      "mediatype": "application/vnd.apache.parquet",
+      "encoding": "UTF-8",
+      "schema": "https://raw.githubusercontent.com/tdwg/camtrap-dp/1.0.2/media-table-schema.json"
+    }
+    ```
+
 {:id="ask"}
 ## Have a question?