You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DataCite maintains an OAI server (https://oai.datacite.org/oai) that serves records for every DOI they have registered. There's been a lot of interest in the community in being able to harvest from them. This way, it will be possible to harvest metadata from institution X even if the institution X does not maintain an OAI server of their own, if they happen to register their DOIs with DataCite. One extra element of this harvesting model that makes it especially powerful and flexible is the DataCite's concept of a "dynamic OAI set": a harvester is not limited to harvesting the pre-defined set of ALL the records registered by the Institution X, but can instead harvest virtually any arbitrary subset thereof; any query that the DataCite search API understands can be used as an OAI set (!). The feature is already in use at IQSS, as a beta version patch.
4
+
5
+
For various reasons, in order to take advantage of this feature harvesting clients must be created using the `/api/harvest/clients` API. Once configured however, harvests can be run from the Harvesting Clients control panel in the UI.
6
+
7
+
DataCite-harvesting clients must be configured with 2 new feature flags, `useListRecords` and `useOaiIdentifiersAsPids` (added in v6.5). Note that these features may be of use when harvesting from other sources, not just from DataCite.
8
+
9
+
See "Harvesting from DataCite" under https://guides.dataverse.org/en/latest/api/native-api.html#managing-harvesting-clients for more information.
Copy file name to clipboardExpand all lines: doc/sphinx-guides/source/admin/harvestclients.rst
+7-1Lines changed: 7 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,12 @@ Please note that in some rare cases this GUI may fail to create a client because
25
25
26
26
Note that as of 5.13, a new entry "Custom HTTP Header" has been added to the Step 1. of Create or Edit form. This optional field can be used to configure this client with a specific HTTP header to be added to every OAI request. This is to accommodate a (rare) use case where the remote server may require a special token of some kind in order to offer some content not available to other clients. Most OAI servers offer the same publicly-available content to all clients, so few admins will have a use for this feature. It is however on the very first, Step 1. screen in case the OAI server requires this token even for the "ListSets" and "ListMetadataFormats" requests, which need to be sent in the Step 2. of creating or editing a client. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
27
27
28
+
Harvesting from Datacite
29
+
~~~~~~~~~~~~~~~~~~~~~~~~
30
+
31
+
As of v6.6, it is now possible to harvest metadata directly from DataCite. Their OAI gateway (https://oai.datacite.org/oai) serves records for every DOI they have registered. Therefore, it is now possible to harvest metadata from any participating institution even if they do not maintain an OAI server of their own. Their OAI implementation offers a concept of a "dynamic set", making it possible to use any query supported by the DataCite search API as though it were a "set". This makes harvesting from them extra flexible, allowing to harvest virtually any arbitrary subset of metadata records, potentially spanning multiple institutions and registration authorities.
32
+
33
+
For various reasons, in order to take advantage of this feature harvesting clients must be created via the ``/api/harvest/clients`` API. Once configured however, harvests can be run from the Harvesting Clients control panel in the UI. See the :ref:`managing-harvesting-clients-api` section of the :doc:`/api/native-api` guide for more information.
28
34
29
35
How to Stop a Harvesting Run in Progress
30
36
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -56,4 +62,4 @@ Harvesting Client Changelog
56
62
Harvesting Non-OAI-PMH
57
63
~~~~~~~~~~~~~~~~~~~~~~
58
64
59
-
`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.
65
+
`DOI2PMH <https://github.com/IQSS/doi2pmh-server>`__ is a community-driven project intended to allow OAI-PMH harvesting from non-OAI-PMH sources.
Copy file name to clipboardExpand all lines: doc/sphinx-guides/source/api/native-api.rst
+83-15Lines changed: 83 additions & 15 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5706,7 +5706,8 @@ The output will look something like the following.
5706
5706
"nickName": "myClient",
5707
5707
"sourceName": "",
5708
5708
"set": "fooSet",
5709
-
"useOaiIdentifiersAsPids": false
5709
+
"useOaiIdentifiersAsPids": false,
5710
+
"useListRecords": false,
5710
5711
"schedule": "none",
5711
5712
"status": "inActive",
5712
5713
"lastHarvest": "Thu Oct 13 14:48:57 EDT 2022",
@@ -5725,23 +5726,25 @@ Create a Harvesting Client
5725
5726
5726
5727
To create a harvesting client you must supply a JSON file that describes the configuration, similarly to the output of the GET API above. The following fields are mandatory:
5727
5728
5728
-
- dataverseAlias: The alias of an existing collection where harvested datasets will be deposited
5729
-
- harvestUrl: The URL of the remote OAI archive
5730
-
- archiveUrl: The URL of the remote archive that will be used in the redirect links pointing back to the archival locations of the harvested records. It may or may not be on the same server as the harvestUrl above. If this OAI archive is another Dataverse installation, it will be the same URL as harvestUrl minus the "/oai". For example: https://demo.dataverse.org/ vs. https://demo.dataverse.org/oai
5731
-
- metadataFormat: A supported metadata format. As of writing this the supported formats are "oai_dc", "oai_ddi" and "dataverse_json".
5729
+
- ``dataverseAlias``: The alias of an existing collection where harvested datasets will be deposited
5730
+
- ``harvestUrl``: The URL of the remote OAI archive
5731
+
- ``archiveUrl``: The URL of the remote archive that will be used in the redirect links pointing back to the archival locations of the harvested records. It may or may not be on the same server as the harvestUrl above. If this OAI archive is another Dataverse installation, it will be the same URL as harvestUrl minus the "/oai". For example: https://demo.dataverse.org/ vs. https://demo.dataverse.org/oai
5732
+
- ``metadataFormat``: A supported metadata format. As of writing this the supported formats are "oai_dc", "oai_ddi" and "dataverse_json".
5732
5733
5733
5734
The following optional fields are supported:
5734
5735
5735
-
- sourceName: When ``index-harvested-metadata-source`` is enabled (see :ref:`feature-flags`), sourceName will override the nickname in the Metadata Source facet. It can be used to group the content from many harvesting clients under the same name.
5736
-
- archiveDescription: What the name suggests. If not supplied, will default to "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data."
5737
-
- set: The OAI set on the remote server. If not supplied, will default to none, i.e., "harvest everything".
5738
-
- style: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
5739
-
- customHeaders: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
5740
-
- allowHarvestingMissingCVV: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
5741
-
- useOaiIdentifiersAsPids: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the `<dc:identifier>` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
5742
-
5743
-
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored. For example, as of writing this there is no way to configure a harvesting schedule via this API.
5744
-
5736
+
- ``sourceName``: When ``index-harvested-metadata-source`` is enabled (see :ref:`feature-flags`), sourceName will override the nickname in the Metadata Source facet. It can be used to group the content from many harvesting clients under the same name.
5737
+
- ``archiveDescription``: What the name suggests. If not supplied, will default to "This Dataset is harvested from our partners. Clicking the link will take you directly to the archival source of the data."
5738
+
- ``set``: The OAI set on the remote server. If not supplied, will default to none, i.e., "harvest everything". (Note: see the note below on using sets when harvesting from DataCite; this is new as of v6.6).
5739
+
- ``style``: Defaults to "default" - a generic OAI archive. (Make sure to use "dataverse" when configuring harvesting from another Dataverse installation).
5740
+
- ``schedule``: Defaults to "none" (not scheduled). Two formats are supported, for weekly- and daily-scheduled harvests; examples: ``Weekly, Sat 5 AM``;``Daily, 11 PM``. Note that if a schedule definition is not formatted exactly as described here, it will be ignored silently and the client will be left unscheduled.
5741
+
- ``customHeaders``: This can be used to configure this client with a specific HTTP header that will be added to every OAI request. This is to accommodate a use case where the remote server requires this header to supply some form of a token in order to offer some content not available to other clients. See the example below. Multiple headers can be supplied separated by `\\n` - actual "backslash" and "n" characters, not a single "new line" character.
5742
+
- ``allowHarvestingMissingCVV``: Flag to allow datasets to be harvested with Controlled Vocabulary Values that existed in the originating Dataverse Project but are not in the harvesting Dataverse Project. (Default is false). Currently only settable using API.
5743
+
- ``useOaiIdentifiersAsPids``: Defaults to false; if set to true, the harvester will attempt to use the identifier from the OAI-PMH record header as the **first choice** for the persistent id of the harvested dataset. When set to false, Dataverse will still attempt to use this identifier, but only if none of the ``<dc:identifier>`` entries in the OAI_DC record contain a valid persistent id (this is new as of v6.5).
5744
+
- ``useListRecords``: Defaults to false; if set to true, the harvester will attempt to retrieve multiple records in a single pass using the OAI-PMH verb ListRecords. By default, our harvester relies on the combination of ListIdentifiers followed by multiple GetRecord calls for each individual record. Note that this option is required when configuring harvesting from DataCite. (this is new as of v6.6).
5745
+
5746
+
Generally, the API will accept the output of the GET version of the API for an existing client as valid input, but some fields will be ignored.
5747
+
5745
5748
You can download this :download:`harvesting-client.json <../_static/api/harvesting-client.json>` file to use as a starting point.
@@ -5819,6 +5822,71 @@ The fully expanded example above (without the environment variables) looks like
5819
5822
5820
5823
Only users with superuser permissions may delete harvesting clients.
5821
5824
5825
+
Harvesting from DataCite
5826
+
~~~~~~~~~~~~~~~~~~~~~~~~
5827
+
5828
+
The following 2 options are **required** when harvesting from DataCite (https://oai.datacite.org/oai):
5829
+
5830
+
.. code-block:: bash
5831
+
5832
+
"useOaiIdentifiersAsPids": true,
5833
+
"useListRecords": true,
5834
+
5835
+
There are two ways the ``set`` parameter can be used when harvesting from DataCite:
5836
+
5837
+
- DataCite maintains pre-configured OAI sets for every subscribing institution that registers DOIs with them. This can be used to harvest the entire set of metadata registered by this organization or school, etc. (this is identical to how the set parameter is used with any other standard OAI archive);
5838
+
- As a unique, proprietary DataCite feature, it can be used to harvest virtually any arbitrary subset of records (potentially spanning different institutions and authorities, etc.). Any query that the DataCite search API understands can be used as an OAI set name (!). For example, the following search query finds one specific dataset:
you can now create a single-record OAI set by using its base64-encoded form as the set name:
5845
+
5846
+
.. code-block:: bash
5847
+
5848
+
echo"doi:10.7910/DVN/TJCLKP"| base64
5849
+
ZG9pOjEwLjc5MTAvRFZOL1RKQ0xLUAo=
5850
+
5851
+
use the encoded string above prefixed by the ``~`` character in your harvesting client configuration:
5852
+
5853
+
.. code-block:: bash
5854
+
5855
+
"set": "~ZG9pOjEwLjc5MTAvRFZOL1RKQ0xLUAo="
5856
+
5857
+
The following configuration will create a client that will harvest the IQSS dataset specified above on a weekly schedule:
5858
+
5859
+
.. code-block:: bash
5860
+
5861
+
{
5862
+
"useOaiIdentifiersAsPids": true,
5863
+
"useListRecords": true,
5864
+
"set": "~ZG9pOjEwLjc5MTAvRFZOL1RKQ0xLUAo=",
5865
+
"nickName": "iqssTJCLKP",
5866
+
"dataverseAlias": "harvestedCollection",
5867
+
"type": "oai",
5868
+
"style": "default",
5869
+
"harvestUrl": "https://oai.datacite.org/oai",
5870
+
"archiveUrl": "https://oai.datacite.org",
5871
+
"archiveDescription": "The metadata for this IQSS Dataset was harvested from DataCite. Clicking the dataset link will take you directly to the original archival location, as registered with DataCite.",
5872
+
"schedule": "Weekly, Tue 4 AM",
5873
+
"metadataFormat": "oai_dc"
5874
+
}
5875
+
5876
+
The queries can be as complex and/or long as necessary, with sub-queries combined via logical ANDs and ORs. Please keep in mind that white spaces must be encoded as ``%20``. For example, the following query:
5877
+
5878
+
.. code-block:: bash
5879
+
5880
+
prefix:10.17603 AND (types.resourceType:Report* OR types.resourceType:Mission*)
<description>OAI-PMH data provider implementation. Use it to build an OAI-PMH endpoint, providing your data records as harvestable resources.</description>
0 commit comments