Custom dataflows processors and goodtables checks for BCO-DMO.
To run the dpp command locally using the custom processors located in this repository, clone this repository and set the environment variable DPP_PROCESSOR_PATH to $REPO_PATH/bcodmo_frictionless/bcodmo_pipeline_processors.
export DPP_PROCESSOR_PATH=/path/to/bcodmo_frictionless/bcodmo_frictionless/bcodmo_pipeline_processorsSee https://github.com/frictionlessdata/datapackage-pipelines for documentation on standard processors.
Loads data into the package.
Parameters:
from- source URL(s) or path(s), can be comma-separated for multiple sourcesname- resource name(s), required (comma-separated if multiple sources)use_filename- use the filename as the resource name instead ofnameinput_separator- separator forfromandnameparameters (default:,)input_path_pattern- treatfromas a glob pattern to match multiple filesremove_empty_rows- remove rows where all values are empty (default:true)missing_values- list of values to interpret as missing data (default:[''])sheet- sheet name/number for Excel filessheet_regex- treatsheetas a regex pattern to match multiple sheetssheet_separator- separator for multiple sheet names insheetformat- file format (supportsbcodmo-fixedwidth,bcodmo-regex-csv)recursion_limit- override Python's recursion limit
Fixed-width format parameters (when format is bcodmo-fixedwidth):
width- column widthinfer- infer width automaticallyparse_seabird_header- parse a.cnvseabird file headerseabird_capture_skipped_rows- list of{column_name, regex}to capture data from skipped rowsseabird_capture_skipped_rows_join- join multiple matches (default:true)seabird_capture_skipped_rows_join_string- join string (default:;)fixedwidth_sample_size- rows to sample for width inference
Concatenates multiple resources into a single resource.
Parameters:
sources- list of resource names to concatenatetarget- target resource configurationname- name of the concatenated resource (default:concat)path- output path
fields- mapping of target field names to source field namesinclude_source_names- list of source identifiers to add as columnstype- one ofresource,path, orfilecolumn_name- name of the new column
missing_values- list of missing value indicators
Dumps data to a local filesystem path.
Parameters:
out-path- output directory (default:.)save_pipeline_spec- save the pipeline-spec.yaml filepipeline_spec- pipeline spec content to savedata_manager- object withnameandorcidkeys for the data manager
Notes:
- Attempts to set file permissions to 775
- Removes carriage return (
\r) from line endings
Dumps data to an S3-compatible storage.
Parameters:
bucket_name- S3 bucket nameprefix- path prefix within the bucketformat- output format (default:csv)save_pipeline_spec- save the pipeline-spec.yaml filepipeline_spec- pipeline spec content to savedata_manager- object withnameandorcidkeysuse_titles- use field titles instead of names in outputtemporal_format_property- property name to use for temporal field formatsdelete- delete existing files at prefix before dumpinglimit_yield- limit number of rows yielded downstreamdump_unique_lat_lon- create a separate file with unique lat/lon pairs
Environment variables:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYLAMINAR_S3_HOST- S3 endpoint URL
Renames fields.
Parameters:
resources- list of resources to operate onfields- list of field mappingsold_field- current field namenew_field- new field name
Renames fields using regular expressions.
Parameters:
resources- list of resources to operate onfields- list of field names to renamepattern- regex patternfind- regex pattern to matchreplace- replacement pattern
Reorders fields in a resource.
Parameters:
resources- list of resources to operate onfields- list of field names in the desired order (must include all fields)
Updates field metadata in the schema.
Parameters:
resources- list of resources to operate onfields- object mapping field names to metadata properties to update
Sets field types and options.
Parameters:
resources- list of resources to operate ontypes- object mapping field names to type optionsregex- treat field names as regex patterns (default:true)
Adds metadata to the resource schema.
Parameters:
resources- list of resources to operate on- Any additional parameters are added as schema metadata
Adds computed fields using boolean conditions.
Parameters:
resources- list of resources to operate onfields- list of new fields to createtarget- name of the new fieldtype- data type of the new fieldfunctions- list of conditions and valuesboolean- boolean expression (see Boolean Syntax below)value- value to set when condition is true (supports{field_name}substitution)math_operation- iftrue, evaluatevalueas a math expressionalways_run- iftrue, skip the boolean check
Notes:
- Functions are evaluated in order; later matches override earlier ones
- Supports datetime, date, and time output types
Filters rows based on a boolean condition.
Parameters:
resources- list of resources to operate onboolean_statement- boolean expression; only rows that pass are kept
Finds and replaces text using regular expressions.
Parameters:
resources- list of resources to operate onfields- list of fields to processname- field namepatterns- list of find/replace patternsfind- regex pattern to findreplace- replacement stringreplace_function- one ofstring,uppercase,lowercase(default:string)replace_missing_values- apply to missing values (default:false)
boolean_statement- optional condition for which rows to process
Formats strings using Python string formatting.
Parameters:
resources- list of resources to operate onfields- list of format operationsoutput_field- name of the output fieldinput_string- Python format string (e.g.,{0:03d}-{1})input_fields- list of field names to use as format arguments
boolean_statement- optional condition for which rows to process
Notes:
- Field types matter: use
{0:03d}for integers,{0:03f}for floats
Splits a field into multiple fields.
Parameters:
resources- list of resources to operate onfields- list of split operationsinput_field- field to splitoutput_fields- list of output field namespattern- regex with capture groups (mutually exclusive withdelimiter)delimiter- regex delimiter to split on (mutually exclusive withpattern)preserve_metadata- copy bcodmo metadata from input field
delete_input- delete the input field after splittingboolean_statement- optional condition for which rows to process
Edits specific cells by row number.
Parameters:
resources- list of resources to operate onedited- object mapping row numbers to edits- Each value is a list of
{field, value}objects
- Each value is a list of
Extracts non-numeric values from fields into new fields.
Parameters:
resources- list of resources to operate onfields- list of field names to processsuffix- suffix for new field names (default:_)preserve_metadata- copy bcodmo metadata to new fieldsboolean_statement- optional condition for which rows to process
Notes:
- Non-numeric values are moved to the new field, and the original is set to null
Rounds numeric fields.
Parameters:
resources- list of resources to operate onfields- list of fields to roundname- field namedigits- decimal placespreserve_trailing_zeros- keep trailing zerosmaximum_precision- only round values with precision >= digitsconvert_to_integer- convert to integer (only whendigitsis 0)
boolean_statement- optional condition for which rows to process
Converts values between units.
Parameters:
resources- list of resources to operate onfields- list of conversionsname- field nameconversion- conversion function:feet_to_meter,fathom_to_meter,inch_to_cm,mile_to_kmpreserve_field- keep the original fieldnew_field_name- name for the converted field (required ifpreserve_fieldis true)preserve_metadata- copy bcodmo metadata to new field
Converts date/time fields between formats.
Parameters:
-
resources- list of resources to operate on -
fields- list of conversionsoutput_field- name of the output fieldoutput_format- Python datetime format stringoutput_type- one ofdatetime,date,time,string(default:datetime)input_type- one ofpython,excel,matlab,decimalDay,decimalYearpreserve_metadata- copy bcodmo metadata from input field
For
pythoninput_type:inputs- list of input fieldsfield- field nameformat- Python datetime format
input_timezone- input timezoneinput_timezone_utc_offset- UTC offset in hoursoutput_timezone- output timezoneoutput_timezone_utc_offset- UTC offset in hoursyear- override year value
For
excel/matlabinput_type:input_field- single input field
For
decimalDayinput_type:input_field- single input fieldyear- year value (required)
For
decimalYearinput_type:input_field- single input fielddecimal_year_start_day- start day (0 or 1, required)
-
boolean_statement- optional condition for which rows to process
Converts coordinates to decimal degrees.
Parameters:
resources- list of resources to operate onfields- list of conversionsinput_field- input field nameoutput_field- output field nameformat- one ofdegrees-minutes-secondsordegrees-decimal_minutespattern- regex with named groups:degrees,minutes,seconds,decimal_minutes,directionaldirectional- compass direction (N,E,S,W) if not in patternhandle_out_of_bounds- handle values outside normal rangespreserve_metadata- copy bcodmo metadata from input field
boolean_statement- optional condition for which rows to process
Removes resources from the pipeline.
Parameters:
resources- list of resource names to remove
Renames a resource.
Parameters:
old_resource- current resource namenew_resource- new resource name
Joins two resources together.
Parameters:
source- source resource configurationname- source resource namekey- join key field(s) or key templatedelete- delete source after join (default:false)
target- target resource configurationname- target resource namekey- join key field(s) or key template
fields- object mapping target field names to source field specsname- source field nameaggregate- aggregation function:sum,avg,median,max,min,first,last,count,any,set,array,counters
mode- join mode:inner,half-outer,full-outer(default:half-outer)
These processors are from the standard dataflows library.
Duplicates a resource within the package.
Parameters:
source- name of the resource to duplicate (default: first resource)target-name- name for the duplicated resource (default:{source}_copy)target-path- path for the duplicated resource (default:{target-name}.csv)duplicate_to_end- place duplicate at end of package instead of after source
Updates package-level metadata.
Parameters:
- Any key-value pairs to add/update in the package descriptor (except
resources)
Updates resource-level metadata.
Parameters:
resources- list of resources to operate onmetadata- object of key-value pairs to add/update in the resource descriptor
Removes fields from resources.
Parameters:
resources- list of resources to operate onfields- list of field names to deleteregex- treat field names as regex patterns (default:true)
Sorts rows by field values.
Parameters:
resources- list of resources to operate onsort-by- field name, format string (e.g.,{field1}{field2}), or callablereverse- sort in descending order (default:false)
Notes:
- Numeric fields are sorted numerically
- Supports multi-field sorting via format strings
Adds computed fields using predefined operations.
Parameters:
resources- list of resources to operate onfields- list of field definitionstarget- name of the new field (or object withnameandtype)operation- one of the operations belowsource- list of source field names (for most operations)with- additional parameter for some operations
Operations:
sum- sum of source field valuesavg- average of source field valuesmax- maximum of source field valuesmin- minimum of source field valuesmultiply- product of source field valuesconstant- constant value (specified inwith)join- join source values with separator (specified inwith)format- format string using row values (specified inwith, e.g.,{field1}-{field2})
Transforms columns into rows (wide to long format).
Parameters:
resources- list of resources to operate onunpivot- list of field specifications to unpivotname- field name or regex patternkeys- object mapping extra key field names to values (can use regex backreferences)
extraKeyFields- list of new key field definitions (withnameandtype)extraValueField- definition for the value field (withnameandtype)regex- treat field names as regex patterns (default:true)
Example:
unpivot:
- name: "value_\\d+"
keys:
year: "\\1"
extraKeyFields:
- name: year
type: integer
extraValueField:
name: value
type: numberJoins two resources together (standard dataflows version).
Parameters:
source- source resource configurationname- source resource namekey- join key field(s) or key templatedelete- delete source after join (default:true)
target- target resource configurationname- target resource namekey- join key field(s) or key template
fields- object mapping target field names to source field specsname- source field nameaggregate- aggregation function:sum,avg,median,max,min,first,last,count,any,set,array,counters
mode- join mode:inner,half-outer,full-outer(default:half-outer)
Notes:
- Source resource must appear before target in the package
- Use
*in fields to include all source fields
Many processors support a boolean_statement parameter for conditional processing.
Comparison terms:
- Numbers:
50,3.14,-10 - Strings:
'value'(single quotes) - Field references:
{field_name}(curly braces) - Dates:
2023-01-15,2023/01/15 - Regular expressions:
re'pattern' - Null values:
null,NULL,None,NONE - Row number:
LINE_NUMBER,ROW_NUMBER
Operators:
- Comparison:
==,!=,>,>=,<,<= - Boolean:
AND,and,&&,OR,or,||
Examples:
{lat} > 50 && {depth} != NULL
{species} == 's. pecies' OR {species} == NULL
{value} == re'^[A-Z]+'
LINE_NUMBER > 10
Notes:
- All terms and operators must be separated by spaces
- Regular expressions can only be compared with
==or!=against strings - Values are compared based on their type (string
'5313'does not equal number5313)
The boolean_add_computed_field processor supports math expressions when math_operation is true.
Operators: +, -, *, /, ^ (exponent)
Example:
value: "{field1} + {field2} * 2"
math_operation: true